initial commit
3
.gitattributes
vendored
Normal file
@@ -0,0 +1,3 @@
|
||||
assets/onnx/*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||
ios/** linguist-ignore
|
||||
web/** linguist-ignore
|
||||
62
.gitignore
vendored
Normal file
@@ -0,0 +1,62 @@
|
||||
assets/*
|
||||
assets/.git
|
||||
assets/.gitignore
|
||||
assets/.gitattributes
|
||||
|
||||
*.onnx
|
||||
onnx
|
||||
|
||||
# Output files
|
||||
results
|
||||
|
||||
# Python
|
||||
__pycache__
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
*.so
|
||||
.Python
|
||||
|
||||
# Virtual environments
|
||||
.venv
|
||||
venv/
|
||||
ENV/
|
||||
env/
|
||||
|
||||
# Node.js
|
||||
node_modules/
|
||||
npm-debug.log*
|
||||
yarn-debug.log*
|
||||
yarn-error.log*
|
||||
package-lock.json
|
||||
|
||||
# Swift
|
||||
.build/
|
||||
.swiftpm/
|
||||
*.xcodeproj
|
||||
*.xcworkspace
|
||||
xcuserdata/
|
||||
DerivedData/
|
||||
|
||||
# Distribution / packaging
|
||||
build/
|
||||
dist/
|
||||
*.egg-info/
|
||||
.eggs/
|
||||
|
||||
# Testing
|
||||
.pytest_cache/
|
||||
.coverage
|
||||
htmlcov/
|
||||
.tox/
|
||||
|
||||
# IDE
|
||||
.vscode/
|
||||
.idea/
|
||||
*.swp
|
||||
*.swo
|
||||
*~
|
||||
|
||||
# OS
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
assets
|
||||
21
LICENSE
Normal file
@@ -0,0 +1,21 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2025 Supertone Inc.
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
452
README.md
Normal file
@@ -0,0 +1,452 @@
|
||||
# Supertonic — Lightning Fast, On-Device TTS
|
||||
|
||||
[](https://huggingface.co/spaces/Supertone/supertonic-2)
|
||||
[](https://huggingface.co/Supertone/supertonic-2)
|
||||
[-Demo-lightgrey)](https://huggingface.co/spaces/Supertone/supertonic#interactive-demo)
|
||||
[-Models-lightgrey)](https://huggingface.co/Supertone/supertonic)
|
||||
|
||||
<p align="center">
|
||||
<img src="img/supertonic_preview_0.1.jpg" alt="Supertonic Banner">
|
||||
</p>
|
||||
|
||||
**Supertonic** is a lightning-fast, on-device text-to-speech system designed for **extreme performance** with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.
|
||||
|
||||
### 📰 Update News
|
||||
|
||||
- **2026.01.22** - **[Voice Builder](https://supertonic.supertone.ai/voice_builder)** is now live! Turn your voice into a deployable, edge-native TTS with permanent ownership.
|
||||
|
||||
<p align="center">
|
||||
<img src="img/voicebuilder_img.png" alt="Voice Builder" width="600">
|
||||
</p>
|
||||
|
||||
- **2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
||||
- **2025.12.10** - Added `supertonic` PyPI package! Install via `pip install supertonic`. For details, visit [supertonic-py documentation](https://supertone-inc.github.io/supertonic-py)
|
||||
- **2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||
- **2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
|
||||
- **2025.11.24** - Added Flutter SDK support with macOS compatibility
|
||||
|
||||
### Table of Contents
|
||||
|
||||
- [Demo](#demo)
|
||||
- [Why Supertonic?](#why-supertonic)
|
||||
- [Language Support](#language-support)
|
||||
- [Getting Started](#getting-started)
|
||||
- [Performance](#performance)
|
||||
- [Built with Supertonic](#built-with-supertonic)
|
||||
- [Citation](#citation)
|
||||
- [License](#license)
|
||||
|
||||
## Demo
|
||||
|
||||
### Raspberry Pi
|
||||
|
||||
Watch Supertonic running on a **Raspberry Pi**, demonstrating on-device, real-time text-to-speech synthesis:
|
||||
|
||||
https://github.com/user-attachments/assets/ea66f6d6-7bc5-4308-8a88-1ce3e07400d2
|
||||
|
||||
### E-Reader
|
||||
|
||||
Experience Supertonic on an **Onyx Boox Go 6** e-reader in airplane mode, achieving an average RTF of 0.3× with zero network dependency:
|
||||
|
||||
https://github.com/user-attachments/assets/64980e58-ad91-423a-9623-78c2ffc13680
|
||||
|
||||
### Chrome Extension
|
||||
|
||||
Turns any webpage into audio in under one second, delivering lightning-fast, on-device text-to-speech with zero network dependency—free, private, and effortless:
|
||||
|
||||
https://github.com/user-attachments/assets/cc8a45fc-5c3e-4b2c-8439-a14c3d00d91c
|
||||
|
||||
---
|
||||
|
||||
> 🎧 **Try it now**: Experience Supertonic in your browser with our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic-2), or get started with pre-trained models from [**Hugging Face Hub**](https://huggingface.co/Supertone/supertonic-2)
|
||||
|
||||
## Why Supertonic?
|
||||
|
||||
- **⚡ Blazingly Fast**: Generates speech up to **167× faster than real-time** on consumer hardware (M4 Pro)—unmatched by any other TTS system
|
||||
- **🪶 Ultra Lightweight**: Only **66M parameters**, optimized for efficient on-device performance with minimal footprint
|
||||
- **📱 On-Device Capable**: **Complete privacy** and **zero latency**—all processing happens locally on your device
|
||||
- **🎨 Natural Text Handling**: Seamlessly processes numbers, dates, currency, abbreviations, and complex expressions without pre-processing
|
||||
- **⚙️ Highly Configurable**: Adjust inference steps, batch processing, and other parameters to match your specific needs
|
||||
- **🧩 Flexible Deployment**: Deploy seamlessly across servers, browsers, and edge devices with multiple runtime backends.
|
||||
|
||||
## Language Support
|
||||
|
||||
We provide ready-to-use TTS inference examples across multiple ecosystems:
|
||||
|
||||
| Language/Platform | Path | Description |
|
||||
|-------------------|------|-------------|
|
||||
| [**Python**](py/) | `py/` | ONNX Runtime inference |
|
||||
| [**Node.js**](nodejs/) | `nodejs/` | Server-side JavaScript |
|
||||
| [**Browser**](web/) | `web/` | WebGPU/WASM inference |
|
||||
| [**Java**](java/) | `java/` | Cross-platform JVM |
|
||||
| [**C++**](cpp/) | `cpp/` | High-performance C++ |
|
||||
| [**C#**](csharp/) | `csharp/` | .NET ecosystem |
|
||||
| [**Go**](go/) | `go/` | Go implementation |
|
||||
| [**Swift**](swift/) | `swift/` | macOS applications |
|
||||
| [**iOS**](ios/) | `ios/` | Native iOS apps |
|
||||
| [**Rust**](rust/) | `rust/` | Memory-safe systems |
|
||||
| [**Flutter**](flutter/) | `flutter/` | Cross-platform apps |
|
||||
|
||||
> For detailed usage instructions, please refer to the README.md in each language directory.
|
||||
|
||||
## Getting Started
|
||||
|
||||
First, clone the repository:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/supertone-inc/supertonic.git
|
||||
cd supertonic
|
||||
```
|
||||
|
||||
### Prerequisites
|
||||
|
||||
Before running the examples, download the ONNX models and preset voices, and place them in the `assets` directory:
|
||||
|
||||
> **Note:** The Hugging Face repository uses Git LFS. Please ensure Git LFS is installed and initialized before cloning or pulling large model files.
|
||||
> - macOS: `brew install git-lfs && git lfs install`
|
||||
> - Generic: see `https://git-lfs.com` for installers
|
||||
|
||||
```bash
|
||||
git clone https://huggingface.co/Supertone/supertonic-2 assets
|
||||
```
|
||||
|
||||
### Quick Start
|
||||
|
||||
**Python Example** ([Details](py/))
|
||||
```bash
|
||||
cd py
|
||||
uv sync
|
||||
uv run example_onnx.py
|
||||
```
|
||||
|
||||
**Node.js Example** ([Details](nodejs/))
|
||||
```bash
|
||||
cd nodejs
|
||||
npm install
|
||||
npm start
|
||||
```
|
||||
|
||||
**Browser Example** ([Details](web/))
|
||||
```bash
|
||||
cd web
|
||||
npm install
|
||||
npm run dev
|
||||
```
|
||||
|
||||
**Java Example** ([Details](java/))
|
||||
```bash
|
||||
cd java
|
||||
mvn clean install
|
||||
mvn exec:java
|
||||
```
|
||||
|
||||
**C++ Example** ([Details](cpp/))
|
||||
```bash
|
||||
cd cpp
|
||||
mkdir build && cd build
|
||||
cmake .. && cmake --build . --config Release
|
||||
./example_onnx
|
||||
```
|
||||
|
||||
**C# Example** ([Details](csharp/))
|
||||
```bash
|
||||
cd csharp
|
||||
dotnet restore
|
||||
dotnet run
|
||||
```
|
||||
|
||||
**Go Example** ([Details](go/))
|
||||
```bash
|
||||
cd go
|
||||
go mod download
|
||||
go run example_onnx.go helper.go
|
||||
```
|
||||
|
||||
**Swift Example** ([Details](swift/))
|
||||
```bash
|
||||
cd swift
|
||||
swift build -c release
|
||||
.build/release/example_onnx
|
||||
```
|
||||
|
||||
**Rust Example** ([Details](rust/))
|
||||
```bash
|
||||
cd rust
|
||||
cargo build --release
|
||||
./target/release/example_onnx
|
||||
```
|
||||
|
||||
**iOS Example** ([Details](ios/))
|
||||
```bash
|
||||
cd ios/ExampleiOSApp
|
||||
xcodegen generate
|
||||
open ExampleiOSApp.xcodeproj
|
||||
```
|
||||
- In Xcode: Targets → ExampleiOSApp → Signing: select your Team
|
||||
- Choose your iPhone as run destination → Build & Run
|
||||
|
||||
|
||||
### Technical Details
|
||||
|
||||
- **Runtime**: ONNX Runtime for cross-platform inference (CPU-optimized; GPU mode is not tested)
|
||||
- **Browser Support**: onnxruntime-web for client-side inference
|
||||
- **Batch Processing**: Supports batch inference for improved throughput
|
||||
- **Audio Output**: Outputs 16-bit WAV files
|
||||
|
||||
## Performance
|
||||
|
||||
We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).
|
||||
|
||||
**Metrics:**
|
||||
- **Characters per Second**: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
|
||||
- **Real-time Factor (RTF)**: Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).
|
||||
|
||||
### Characters per Second
|
||||
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
||||
|--------|-----------------|----------------|-----------------|
|
||||
| **Supertonic** (M4 pro - CPU) | 912 | 1048 | 1263 |
|
||||
| **Supertonic** (M4 pro - WebGPU) | 996 | 1801 | 2509 |
|
||||
| **Supertonic** (RTX4090) | 2615 | 6548 | 12164 |
|
||||
| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 144 | 209 | 287 |
|
||||
| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 37 | 55 | 82 |
|
||||
| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 12 | 18 | 24 |
|
||||
| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 38 | 64 | 92 |
|
||||
| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 104 | 107 | 117 |
|
||||
| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 37 | 42 | 47 |
|
||||
|
||||
> **Notes:**
|
||||
> `API` = Cloud-based API services (measured from Seoul)
|
||||
> `Open` = Open-source models
|
||||
> Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX
|
||||
> Supertonic (RTX4090): Tested with PyTorch model
|
||||
> Kokoro: Tested on M4 Pro CPU with ONNX
|
||||
> NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF
|
||||
|
||||
### Real-time Factor
|
||||
|
||||
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
||||
|--------|-----------------|----------------|-----------------|
|
||||
| **Supertonic** (M4 pro - CPU) | 0.015 | 0.013 | 0.012 |
|
||||
| **Supertonic** (M4 pro - WebGPU) | 0.014 | 0.007 | 0.006 |
|
||||
| **Supertonic** (RTX4090) | 0.005 | 0.002 | 0.001 |
|
||||
| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 0.133 | 0.077 | 0.057 |
|
||||
| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 0.471 | 0.302 | 0.201 |
|
||||
| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 1.060 | 0.673 | 0.541 |
|
||||
| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 0.372 | 0.206 | 0.163 |
|
||||
| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 0.144 | 0.124 | 0.126 |
|
||||
| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 0.390 | 0.338 | 0.343 |
|
||||
|
||||
<details>
|
||||
<summary><b>Additional Performance Data (5-step inference)</b></summary>
|
||||
|
||||
<br>
|
||||
|
||||
**Characters per Second (5-step)**
|
||||
|
||||
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
||||
|--------|-----------------|----------------|-----------------|
|
||||
| **Supertonic** (M4 pro - CPU) | 596 | 691 | 850 |
|
||||
| **Supertonic** (M4 pro - WebGPU) | 570 | 1118 | 1546 |
|
||||
| **Supertonic** (RTX4090) | 1286 | 3757 | 6242 |
|
||||
|
||||
**Real-time Factor (5-step)**
|
||||
|
||||
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
||||
|--------|-----------------|----------------|-----------------|
|
||||
| **Supertonic** (M4 pro - CPU) | 0.023 | 0.019 | 0.018 |
|
||||
| **Supertonic** (M4 pro - WebGPU) | 0.024 | 0.012 | 0.010 |
|
||||
| **Supertonic** (RTX4090) | 0.011 | 0.004 | 0.002 |
|
||||
|
||||
</details>
|
||||
|
||||
### Natural Text Handling
|
||||
|
||||
Supertonic is designed to handle complex, real-world text inputs that contain numbers, currency symbols, abbreviations, dates, and proper nouns.
|
||||
|
||||
> 🎧 **View audio samples more easily**: Check out our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic#text-handling) for a better viewing experience of all audio examples
|
||||
|
||||
**Overview of Test Cases:**
|
||||
|
||||
| Category | Key Challenges | Supertonic | ElevenLabs | OpenAI | Gemini | Microsoft |
|
||||
|:--------:|:--------------:|:----------:|:----------:|:------:|:------:|:---------:|
|
||||
| Financial Expression | Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes | ✅ | ❌ | ❌ | ❌ | ❌ |
|
||||
| Time and Date | Time notation, abbreviated weekdays/months, date formats | ✅ | ❌ | ❌ | ❌ | ❌ |
|
||||
| Phone Number | Area codes, hyphens, extensions (ext.) | ✅ | ❌ | ❌ | ❌ | ❌ |
|
||||
| Technical Unit | Decimal numbers with units, abbreviated technical notations | ✅ | ❌ | ❌ | ❌ | ❌ |
|
||||
|
||||
<details>
|
||||
<summary><b>Example 1: Financial Expression</b></summary>
|
||||
|
||||
<br>
|
||||
|
||||
**Text:**
|
||||
> "The startup secured **$5.2M** in venture capital, a huge leap from their initial **$450K** seed round."
|
||||
|
||||
**Challenges:**
|
||||
- Decimal point in currency ($5.2M should be read as "five point two million")
|
||||
- Abbreviated magnitude units (M for million, K for thousand)
|
||||
- Currency symbol ($) that needs to be properly pronounced as "dollars"
|
||||
|
||||
**Audio Samples:**
|
||||
|
||||
| System | Result | Audio Sample |
|
||||
|--------|--------|--------------|
|
||||
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1eancUOhiSXCVoTu9ddh4S-OcVQaWrPV-/view?usp=sharing) |
|
||||
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1-r2scv7XQ1crIDu6QOh3eqVl445W6ap_/view?usp=sharing) |
|
||||
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1MFDXMjfmsAVOqwPx7iveS0KUJtZvcwxB/view?usp=sharing) |
|
||||
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1dEHpNzfMUucFTJPQK0k4RcFZvPwQTt09/view?usp=sharing) |
|
||||
| VibeVoice Realtime 0.5B | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1b69XWBQnSZZ0WZeR3avv7E8mSdoN6p6P/view?usp=sharing) |
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><b>Example 2: Time and Date</b></summary>
|
||||
|
||||
<br>
|
||||
|
||||
**Text:**
|
||||
> "The train delay was announced at **4:45 PM** on **Wed, Apr 3, 2024** due to track maintenance."
|
||||
|
||||
**Challenges:**
|
||||
- Time expression with PM notation (4:45 PM)
|
||||
- Abbreviated weekday (Wed)
|
||||
- Abbreviated month (Apr)
|
||||
- Full date format (Apr 3, 2024)
|
||||
|
||||
**Audio Samples:**
|
||||
|
||||
| System | Result | Audio Sample |
|
||||
|--------|--------|--------------|
|
||||
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1ehkZU8eiizBenG2DgR5tzBGQBvHS0Uaj/view?usp=sharing) |
|
||||
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1ta3r6jFyebmA-sT44l8EaEQcMLVmuOEr/view?usp=sharing) |
|
||||
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1sskmem9AzHAQ3Hv8DRSZoqX_pye-CXuU/view?usp=sharing) |
|
||||
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1zx9X8oMsLMXW0Zx_SURoqjju-By2yh_n/view?usp=sharing) |
|
||||
| VibeVoice Realtime 0.5B | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1ZpGEstZr4hA0EdAWBMCUFFWuAkIpYsVh/view?usp=sharing) |
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><b>Example 3: Phone Number</b></summary>
|
||||
|
||||
<br>
|
||||
|
||||
**Text:**
|
||||
> "You can reach the hotel front desk at **(212) 555-0142 ext. 402** anytime."
|
||||
|
||||
**Challenges:**
|
||||
- Area code in parentheses that should be read as separate digits
|
||||
- Phone number with hyphen separator (555-0142)
|
||||
- Abbreviated extension notation (ext.)
|
||||
- Extension number (402)
|
||||
|
||||
**Audio Samples:**
|
||||
|
||||
| System | Result | Audio Sample |
|
||||
|--------|--------|--------------|
|
||||
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1z-e5iTsihryMR8ll1-N1YXkB2CIJYJ6F/view?usp=sharing) |
|
||||
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1HAzVXFTZfZm0VEK2laSpsMTxzufcuaxA/view?usp=sharing) |
|
||||
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/15tjfAmb3GbjP_kmvD7zSdIWkhtAaCPOg/view?usp=sharing) |
|
||||
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1BCL8n7yligUZyso970ud7Gf5NWb1OhKD/view?usp=sharing) |
|
||||
| VibeVoice Realtime 0.5B | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1c0c0YM_Qm7XxSk2uSVYLbITgEDTqaVzL/view?usp=sharing) |
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><b>Example 4: Technical Unit</b></summary>
|
||||
|
||||
<br>
|
||||
|
||||
**Text:**
|
||||
> "Our drone battery lasts **2.3h** when flying at **30kph** with full camera payload."
|
||||
|
||||
**Challenges:**
|
||||
- Decimal time duration with abbreviation (2.3h = two point three hours)
|
||||
- Speed unit with abbreviation (30kph = thirty kilometers per hour)
|
||||
- Technical abbreviations (h for hours, kph for kilometers per hour)
|
||||
- Technical/engineering context requiring proper pronunciation
|
||||
|
||||
**Audio Samples:**
|
||||
|
||||
| System | Result | Audio Sample |
|
||||
|--------|--------|--------------|
|
||||
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1kvOBvswFkLfmr8hGplH0V2XiMxy1shYf/view?usp=sharing) |
|
||||
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1_SzfjWJe5YEd0t3R7DztkYhHcI_av48p/view?usp=sharing) |
|
||||
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1P5BSilj5xFPTV2Xz6yW5jitKZohO9o-6/view?usp=sharing) |
|
||||
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1GU82SnWC50OvC8CZNjhxvNZFKQb7I9_Y/view?usp=sharing) |
|
||||
| VibeVoice Realtime 0.5B | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1lUTrxrAQy_viEK2Hlu3KLLtTCe8jvbdV/view?usp=sharing) |
|
||||
|
||||
</details>
|
||||
|
||||
> **Note:** These samples demonstrate how each system handles text normalization and pronunciation of complex expressions **without requiring pre-processing or phonetic annotations**.
|
||||
|
||||
## Built with Supertonic
|
||||
|
||||
| Project | Description | Links |
|
||||
|---------|-------------|-------|
|
||||
| **TLDRL** | Free, on-device TTS extension for reading any webpage | [Chrome](https://chromewebstore.google.com/detail/tldrl-lightning-tts-power/mdbiaajonlkomihpcaffhkagodbcgbme) |
|
||||
| **Read Aloud** | Open-source TTS browser extension | [Chrome](https://chromewebstore.google.com/detail/read-aloud-a-text-to-spee/hdhinadidafjejdhmfkjgnolgimiaplp) · [Edge](https://microsoftedge.microsoft.com/addons/detail/read-aloud-a-text-to-spe/pnfonnnmfjnpfgagnklfaccicnnjcdkm) · [GitHub](https://github.com/ken107/read-aloud) |
|
||||
| **PageEcho** | E-Book reader app for iOS | [App Store](https://apps.apple.com/us/app/pageecho/id6755965837) |
|
||||
| **VoiceChat** | On-device voice-to-voice LLM chatbot in the browser | [Demo](https://huggingface.co/spaces/RickRossTN/ai-voice-chat) · [GitHub](https://github.com/irelate-ai/voice-chat) |
|
||||
| **OmniAvatar** | Talking avatar video generator from photo + speech | [Demo](https://huggingface.co/spaces/alexnasa/OmniAvatar) |
|
||||
| **CopiloTTS** | Kotlin Multiplatform TTS SDK via ONNX Runtime | [GitHub](https://github.com/sigmadeltasoftware/CopiloTTS) |
|
||||
| **Voice Mixer** | PyQt5 tool for mixing and modifying voice styles | [GitHub](https://github.com/Topping1/Supertonic-Voice-Mixer) |
|
||||
| **Supertonic MNN** | Lightweight library based on MNN (fp32/fp16/int8) | [GitHub](https://github.com/vra/supertonic-mnn) · [PyPI](https://pypi.org/project/supertonic-mnn/) |
|
||||
| **Transformers.js** | Hugging Face's JS library with Supertonic support | [GitHub PR](https://github.com/huggingface/transformers.js/pull/1459) · [Demo](https://huggingface.co/spaces/webml-community/Supertonic-TTS-WebGPU) |
|
||||
| **Pinokio** | 1-click localhost cloud for Mac, Windows, and Linux | [Pinokio](https://pinokio.co/) · [GitHub](https://github.com/SUP3RMASS1VE/SuperTonic-TTS) |
|
||||
|
||||
## Citation
|
||||
|
||||
The following papers describe the core technologies used in Supertonic. If you use this system in your research or find these techniques useful, please consider citing the relevant papers:
|
||||
|
||||
### SupertonicTTS: Main Architecture
|
||||
|
||||
This paper introduces the overall architecture of SupertonicTTS, including the speech autoencoder, flow-matching based text-to-latent module, and efficient design choices.
|
||||
|
||||
```bibtex
|
||||
@article{kim2025supertonic,
|
||||
title={SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System},
|
||||
author={Kim, Hyeongju and Yang, Jinhyeok and Yu, Yechan and Ji, Seunghun and Morton, Jacob and Bous, Frederik and Byun, Joon and Lee, Juheon},
|
||||
journal={arXiv preprint arXiv:2503.23108},
|
||||
year={2025},
|
||||
url={https://arxiv.org/abs/2503.23108}
|
||||
}
|
||||
```
|
||||
|
||||
### Length-Aware RoPE: Text-Speech Alignment
|
||||
|
||||
This paper presents Length-Aware Rotary Position Embedding (LARoPE), which improves text-speech alignment in cross-attention mechanisms.
|
||||
|
||||
```bibtex
|
||||
@article{kim2025larope,
|
||||
title={Length-Aware Rotary Position Embedding for Text-Speech Alignment},
|
||||
author={Kim, Hyeongju and Lee, Juheon and Yang, Jinhyeok and Morton, Jacob},
|
||||
journal={arXiv preprint arXiv:2509.11084},
|
||||
year={2025},
|
||||
url={https://arxiv.org/abs/2509.11084}
|
||||
}
|
||||
```
|
||||
|
||||
### Self-Purifying Flow Matching: Training with Noisy Labels
|
||||
|
||||
This paper describes the self-purification technique for training flow matching models robustly with noisy or unreliable labels.
|
||||
|
||||
```bibtex
|
||||
@article{kim2025spfm,
|
||||
title={Training Flow Matching Models with Reliable Labels via Self-Purification},
|
||||
author={Kim, Hyeongju and Yu, Yechan and Yi, June Young and Lee, Juheon},
|
||||
journal={arXiv preprint arXiv:2509.19091},
|
||||
year={2025},
|
||||
url={https://arxiv.org/abs/2509.19091}
|
||||
}
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
This project's sample code is released under the MIT License. - see the [LICENSE](https://github.com/supertone-inc/supertonic?tab=MIT-1-ov-file) for details.
|
||||
|
||||
The accompanying model is released under the OpenRAIL-M License. - see the [LICENSE](https://huggingface.co/Supertone/supertonic-2/blob/main/LICENSE) file for details.
|
||||
|
||||
This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the [LICENSE](https://docs.pytorch.org/FBGEMM/general/License.html) for details.
|
||||
|
||||
Copyright (c) 2026 Supertone Inc.
|
||||
|
||||
122
cpp/CMakeLists.txt
Normal file
@@ -0,0 +1,122 @@
|
||||
cmake_minimum_required(VERSION 3.15)
|
||||
project(Supertonic_CPP)
|
||||
|
||||
set(CMAKE_CXX_STANDARD 17)
|
||||
set(CMAKE_CXX_STANDARD_REQUIRED ON)
|
||||
|
||||
# Enable aggressive optimization
|
||||
if(NOT CMAKE_BUILD_TYPE)
|
||||
set(CMAKE_BUILD_TYPE Release)
|
||||
endif()
|
||||
|
||||
# Add optimization flags
|
||||
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3 -DNDEBUG -ffast-math")
|
||||
set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -O3 -DNDEBUG -ffast-math")
|
||||
|
||||
# Find required packages
|
||||
find_package(PkgConfig REQUIRED)
|
||||
find_package(OpenMP)
|
||||
|
||||
# ONNX Runtime - Try multiple methods
|
||||
# Method 1: Try to find via CMake config
|
||||
find_package(onnxruntime QUIET CONFIG)
|
||||
|
||||
if(NOT onnxruntime_FOUND)
|
||||
# Method 2: Try pkg-config
|
||||
pkg_check_modules(ONNXRUNTIME QUIET libonnxruntime)
|
||||
|
||||
if(ONNXRUNTIME_FOUND)
|
||||
set(ONNXRUNTIME_INCLUDE_DIR ${ONNXRUNTIME_INCLUDE_DIRS})
|
||||
set(ONNXRUNTIME_LIB ${ONNXRUNTIME_LIBRARIES})
|
||||
else()
|
||||
# Method 3: Manual search in common locations
|
||||
find_path(ONNXRUNTIME_INCLUDE_DIR
|
||||
NAMES onnxruntime_cxx_api.h
|
||||
PATHS
|
||||
/usr/local/include
|
||||
/opt/homebrew/include
|
||||
/usr/include
|
||||
${CMAKE_PREFIX_PATH}/include
|
||||
PATH_SUFFIXES onnxruntime
|
||||
)
|
||||
|
||||
find_library(ONNXRUNTIME_LIB
|
||||
NAMES onnxruntime libonnxruntime
|
||||
PATHS
|
||||
/usr/local/lib
|
||||
/opt/homebrew/lib
|
||||
/usr/lib
|
||||
${CMAKE_PREFIX_PATH}/lib
|
||||
)
|
||||
endif()
|
||||
|
||||
if(NOT ONNXRUNTIME_INCLUDE_DIR OR NOT ONNXRUNTIME_LIB)
|
||||
message(FATAL_ERROR "ONNX Runtime not found. Please install it:\n"
|
||||
" macOS: brew install onnxruntime\n"
|
||||
" Ubuntu: See README.md for installation instructions")
|
||||
endif()
|
||||
|
||||
message(STATUS "Found ONNX Runtime:")
|
||||
message(STATUS " Include: ${ONNXRUNTIME_INCLUDE_DIR}")
|
||||
message(STATUS " Library: ${ONNXRUNTIME_LIB}")
|
||||
endif()
|
||||
|
||||
# nlohmann/json
|
||||
find_package(nlohmann_json REQUIRED)
|
||||
|
||||
# Include directories
|
||||
if(NOT onnxruntime_FOUND)
|
||||
include_directories(${ONNXRUNTIME_INCLUDE_DIR})
|
||||
endif()
|
||||
|
||||
# Helper library
|
||||
add_library(tts_helper STATIC
|
||||
helper.cpp
|
||||
helper.h
|
||||
)
|
||||
|
||||
if(onnxruntime_FOUND)
|
||||
target_link_libraries(tts_helper
|
||||
onnxruntime::onnxruntime
|
||||
nlohmann_json::nlohmann_json
|
||||
)
|
||||
else()
|
||||
target_include_directories(tts_helper PUBLIC ${ONNXRUNTIME_INCLUDE_DIR})
|
||||
target_link_libraries(tts_helper
|
||||
${ONNXRUNTIME_LIB}
|
||||
nlohmann_json::nlohmann_json
|
||||
)
|
||||
endif()
|
||||
|
||||
# Enable OpenMP if available
|
||||
if(OpenMP_CXX_FOUND)
|
||||
target_link_libraries(tts_helper OpenMP::OpenMP_CXX)
|
||||
message(STATUS "OpenMP enabled for parallel processing")
|
||||
else()
|
||||
message(WARNING "OpenMP not found - parallel processing will be disabled")
|
||||
endif()
|
||||
|
||||
# Example executable
|
||||
add_executable(example_onnx
|
||||
example_onnx.cpp
|
||||
)
|
||||
|
||||
if(onnxruntime_FOUND)
|
||||
target_link_libraries(example_onnx
|
||||
tts_helper
|
||||
onnxruntime::onnxruntime
|
||||
nlohmann_json::nlohmann_json
|
||||
)
|
||||
else()
|
||||
target_link_libraries(example_onnx
|
||||
tts_helper
|
||||
${ONNXRUNTIME_LIB}
|
||||
nlohmann_json::nlohmann_json
|
||||
)
|
||||
endif()
|
||||
|
||||
# Installation
|
||||
install(TARGETS example_onnx DESTINATION bin)
|
||||
install(TARGETS tts_helper DESTINATION lib)
|
||||
install(FILES helper.h DESTINATION include)
|
||||
|
||||
139
cpp/README.md
Normal file
@@ -0,0 +1,139 @@
|
||||
# Supertonic C++ Implementation
|
||||
|
||||
High-performance text-to-speech inference using ONNX Runtime.
|
||||
|
||||
## 📰 Update News
|
||||
|
||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
||||
|
||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||
|
||||
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
|
||||
|
||||
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
|
||||
|
||||
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
|
||||
|
||||
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
|
||||
|
||||
## Requirements
|
||||
|
||||
- C++17 compiler, CMake 3.15+
|
||||
- Libraries: ONNX Runtime, nlohmann/json
|
||||
|
||||
## Installation
|
||||
|
||||
**Ubuntu/Debian:**
|
||||
> ⚠️ **Note:** Installation instructions not yet verified.
|
||||
|
||||
```bash
|
||||
sudo apt-get install -y cmake g++ nlohmann-json3-dev
|
||||
wget https://github.com/microsoft/onnxruntime/releases/download/v1.16.3/onnxruntime-linux-x64-1.16.3.tgz
|
||||
tar -xzf onnxruntime-linux-x64-1.16.3.tgz
|
||||
sudo cp -r onnxruntime-linux-x64-1.16.3/include/* /usr/local/include/
|
||||
sudo cp -r onnxruntime-linux-x64-1.16.3/lib/* /usr/local/lib/
|
||||
sudo ldconfig
|
||||
```
|
||||
|
||||
**macOS:**
|
||||
```bash
|
||||
brew install cmake nlohmann-json onnxruntime
|
||||
```
|
||||
|
||||
**Windows (vcpkg):**
|
||||
> ⚠️ **Note:** Installation instructions not yet verified.
|
||||
|
||||
```powershell
|
||||
vcpkg install nlohmann-json:x64-windows onnxruntime:x64-windows
|
||||
vcpkg integrate install
|
||||
```
|
||||
|
||||
## Building
|
||||
|
||||
```bash
|
||||
cd cpp && mkdir build && cd build
|
||||
cmake .. && cmake --build . --config Release
|
||||
./example_onnx
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
./example_onnx
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `../assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
./example_onnx \
|
||||
--voice-style ../assets/voice_styles/M1.json,../assets/voice_styles/F1.json \
|
||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
|
||||
--lang en,ko \
|
||||
--batch
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use `--batch` flag to enable batch processing mode
|
||||
- Generate speech for 2 different voice-text pairs
|
||||
- Use male voice style (M1.json) for the first English text
|
||||
- Use female voice style (F1.json) for the second Korean text
|
||||
- Process both samples in a single batch (automatic text chunking disabled)
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
./example_onnx \
|
||||
--total-step 10 \
|
||||
--voice-style ../assets/voice_styles/M1.json \
|
||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
### Example 4: Long-Form Inference
|
||||
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
|
||||
```bash
|
||||
./example_onnx \
|
||||
--voice-style ../assets/voice_styles/M1.json \
|
||||
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Automatically split the long text into smaller chunks (max 300 characters by default)
|
||||
- Process each chunk separately while maintaining natural speech flow
|
||||
- Insert brief silences (0.3 seconds) between chunks for natural pacing
|
||||
- Combine all chunks into a single output audio file
|
||||
|
||||
**Note**: When using batch mode (`--batch`), automatic text chunking is disabled. Use non-batch mode for long-form text synthesis.
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `--onnx-dir` | str | `../assets/onnx` | Path to ONNX model directory |
|
||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
|
||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `--voice-style` | str | `../assets/voice_styles/M1.json` | Voice style file path(s) (comma-separated for batch) |
|
||||
| `--text` | str | (long default text) | Text(s) to synthesize (pipe-separated for batch) |
|
||||
| `--lang` | str | `en` | Language(s) for text(s): `en`, `ko`, `es`, `pt`, `fr` (comma-separated for batch) |
|
||||
| `--save-dir` | str | `results` | Output directory |
|
||||
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
|
||||
- **Multilingual Support**: Use `--lang` to specify language(s). Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
|
||||
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
|
||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||
121
cpp/example_onnx.cpp
Normal file
@@ -0,0 +1,121 @@
|
||||
#include "helper.h"
|
||||
#include <iostream>
|
||||
#include <filesystem>
|
||||
#include <algorithm>
|
||||
#include <string>
|
||||
#include <vector>
|
||||
|
||||
namespace fs = std::filesystem;
|
||||
|
||||
struct Args {
|
||||
std::string onnx_dir = "../assets/onnx";
|
||||
int total_step = 5;
|
||||
float speed = 1.05f;
|
||||
int n_test = 4;
|
||||
std::vector<std::string> voice_style = {"../assets/voice_styles/M1.json"};
|
||||
std::vector<std::string> text = {
|
||||
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
};
|
||||
std::vector<std::string> lang = {"en"};
|
||||
std::string save_dir = "results";
|
||||
bool batch = false;
|
||||
};
|
||||
|
||||
auto splitString = [](const std::string& str, char delim) {
|
||||
std::vector<std::string> result;
|
||||
size_t start = 0, pos;
|
||||
while ((pos = str.find(delim, start)) != std::string::npos) {
|
||||
result.push_back(str.substr(start, pos - start));
|
||||
start = pos + 1;
|
||||
}
|
||||
result.push_back(str.substr(start));
|
||||
return result;
|
||||
};
|
||||
|
||||
Args parseArgs(int argc, char* argv[]) {
|
||||
Args args;
|
||||
for (int i = 1; i < argc; i++) {
|
||||
std::string arg = argv[i];
|
||||
if (arg == "--onnx-dir" && i + 1 < argc) args.onnx_dir = argv[++i];
|
||||
else if (arg == "--total-step" && i + 1 < argc) args.total_step = std::stoi(argv[++i]);
|
||||
else if (arg == "--speed" && i + 1 < argc) args.speed = std::stof(argv[++i]);
|
||||
else if (arg == "--n-test" && i + 1 < argc) args.n_test = std::stoi(argv[++i]);
|
||||
else if (arg == "--voice-style" && i + 1 < argc) args.voice_style = splitString(argv[++i], ',');
|
||||
else if (arg == "--text" && i + 1 < argc) args.text = splitString(argv[++i], '|');
|
||||
else if (arg == "--lang" && i + 1 < argc) args.lang = splitString(argv[++i], ',');
|
||||
else if (arg == "--save-dir" && i + 1 < argc) args.save_dir = argv[++i];
|
||||
else if (arg == "--batch") args.batch = true;
|
||||
}
|
||||
return args;
|
||||
}
|
||||
|
||||
int main(int argc, char* argv[]) {
|
||||
std::cout << "=== TTS Inference with ONNX Runtime (C++) ===\n\n";
|
||||
|
||||
// --- 1. Parse arguments --- //
|
||||
Args args = parseArgs(argc, argv);
|
||||
int total_step = args.total_step;
|
||||
float speed = args.speed;
|
||||
int n_test = args.n_test;
|
||||
std::string save_dir = args.save_dir;
|
||||
std::vector<std::string> voice_style_paths = args.voice_style;
|
||||
std::vector<std::string> text_list = args.text;
|
||||
std::vector<std::string> lang_list = args.lang;
|
||||
bool batch = args.batch;
|
||||
|
||||
if (voice_style_paths.size() != text_list.size()) {
|
||||
std::cerr << "Error: Number of voice styles (" << voice_style_paths.size()
|
||||
<< ") must match number of texts (" << text_list.size() << ")\n";
|
||||
return 1;
|
||||
}
|
||||
int bsz = voice_style_paths.size();
|
||||
|
||||
// --- 2. Load Text to Speech --- //
|
||||
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "TTS");
|
||||
Ort::MemoryInfo memory_info = Ort::MemoryInfo::CreateCpu(
|
||||
OrtAllocatorType::OrtArenaAllocator, OrtMemType::OrtMemTypeDefault
|
||||
);
|
||||
|
||||
auto text_to_speech = loadTextToSpeech(env, args.onnx_dir, false);
|
||||
std::cout << std::endl;
|
||||
|
||||
// --- 3. Load Voice Style --- //
|
||||
auto style = loadVoiceStyle(voice_style_paths, true);
|
||||
|
||||
// --- 4. Synthesize speech --- //
|
||||
fs::create_directories(save_dir);
|
||||
|
||||
for (int n = 0; n < n_test; n++) {
|
||||
std::cout << "\n[" << (n + 1) << "/" << n_test << "] Starting synthesis...\n";
|
||||
|
||||
auto result = timer("Generating speech from text", [&]() {
|
||||
if (batch) {
|
||||
return text_to_speech->batch(memory_info, text_list, lang_list, style, total_step, speed);
|
||||
} else {
|
||||
return text_to_speech->call(memory_info, text_list[0], lang_list[0], style, total_step, speed);
|
||||
}
|
||||
});
|
||||
|
||||
int sample_rate = text_to_speech->getSampleRate();
|
||||
int wav_shape_1 = result.wav.size() / bsz;
|
||||
|
||||
for (int b = 0; b < bsz; b++) {
|
||||
std::string fname = sanitizeFilename(text_list[b], 20) + "_" + std::to_string(n + 1) + ".wav";
|
||||
int wav_len = static_cast<int>(sample_rate * result.duration[b]);
|
||||
|
||||
std::vector<float> wav_out(
|
||||
result.wav.begin() + b * wav_shape_1,
|
||||
result.wav.begin() + b * wav_shape_1 + wav_len
|
||||
);
|
||||
|
||||
std::string output_path = save_dir + "/" + fname;
|
||||
writeWavFile(output_path, wav_out, sample_rate);
|
||||
std::cout << "Saved: " << output_path << "\n";
|
||||
}
|
||||
|
||||
clearTensorBuffers();
|
||||
}
|
||||
|
||||
std::cout << "\n=== Synthesis completed successfully! ===\n";
|
||||
return 0;
|
||||
}
|
||||
1186
cpp/helper.cpp
Normal file
229
cpp/helper.h
Normal file
@@ -0,0 +1,229 @@
|
||||
#pragma once
|
||||
|
||||
#include <string>
|
||||
#include <vector>
|
||||
#include <memory>
|
||||
#include <iostream>
|
||||
#include <iomanip>
|
||||
#include <chrono>
|
||||
#include <onnxruntime_cxx_api.h>
|
||||
|
||||
// Available languages for multilingual TTS
|
||||
extern const std::vector<std::string> AVAILABLE_LANGS;
|
||||
|
||||
/**
|
||||
* Configuration structure
|
||||
*/
|
||||
struct Config {
|
||||
struct AEConfig {
|
||||
int sample_rate;
|
||||
int base_chunk_size;
|
||||
} ae;
|
||||
|
||||
struct TTLConfig {
|
||||
int chunk_compress_factor;
|
||||
int latent_dim;
|
||||
} ttl;
|
||||
};
|
||||
|
||||
/**
|
||||
* Unicode text processor
|
||||
*/
|
||||
class UnicodeProcessor {
|
||||
public:
|
||||
explicit UnicodeProcessor(const std::string& unicode_indexer_json_path);
|
||||
|
||||
// Process text list to text IDs and mask
|
||||
void call(
|
||||
const std::vector<std::string>& text_list,
|
||||
const std::vector<std::string>& lang_list,
|
||||
std::vector<std::vector<int64_t>>& text_ids,
|
||||
std::vector<std::vector<std::vector<float>>>& text_mask
|
||||
);
|
||||
|
||||
private:
|
||||
std::vector<int64_t> indexer_;
|
||||
|
||||
std::string preprocessText(const std::string& text, const std::string& lang);
|
||||
std::vector<uint16_t> textToUnicodeValues(const std::string& text);
|
||||
std::vector<std::vector<std::vector<float>>> getTextMask(
|
||||
const std::vector<int64_t>& text_ids_lengths
|
||||
);
|
||||
};
|
||||
|
||||
/**
|
||||
* Style class
|
||||
*/
|
||||
class Style {
|
||||
public:
|
||||
Style(const std::vector<float>& ttl_data, const std::vector<int64_t>& ttl_shape,
|
||||
const std::vector<float>& dp_data, const std::vector<int64_t>& dp_shape);
|
||||
|
||||
const std::vector<float>& getTtlData() const { return ttl_data_; }
|
||||
const std::vector<float>& getDpData() const { return dp_data_; }
|
||||
const std::vector<int64_t>& getTtlShape() const { return ttl_shape_; }
|
||||
const std::vector<int64_t>& getDpShape() const { return dp_shape_; }
|
||||
|
||||
private:
|
||||
std::vector<float> ttl_data_;
|
||||
std::vector<float> dp_data_;
|
||||
std::vector<int64_t> ttl_shape_;
|
||||
std::vector<int64_t> dp_shape_;
|
||||
};
|
||||
|
||||
/**
|
||||
* TextToSpeech class
|
||||
*/
|
||||
class TextToSpeech {
|
||||
public:
|
||||
TextToSpeech(
|
||||
const Config& cfgs,
|
||||
UnicodeProcessor* text_processor,
|
||||
Ort::Session* dp_ort,
|
||||
Ort::Session* text_enc_ort,
|
||||
Ort::Session* vector_est_ort,
|
||||
Ort::Session* vocoder_ort
|
||||
);
|
||||
|
||||
struct SynthesisResult {
|
||||
std::vector<float> wav;
|
||||
std::vector<float> duration;
|
||||
};
|
||||
|
||||
SynthesisResult call(
|
||||
Ort::MemoryInfo& memory_info,
|
||||
const std::string& text,
|
||||
const std::string& lang,
|
||||
const Style& style,
|
||||
int total_step,
|
||||
float speed = 1.05f,
|
||||
float silence_duration = 0.3f
|
||||
);
|
||||
|
||||
SynthesisResult batch(
|
||||
Ort::MemoryInfo& memory_info,
|
||||
const std::vector<std::string>& text_list,
|
||||
const std::vector<std::string>& lang_list,
|
||||
const Style& style,
|
||||
int total_step,
|
||||
float speed = 1.05f
|
||||
);
|
||||
|
||||
int getSampleRate() const { return sample_rate_; }
|
||||
|
||||
private:
|
||||
SynthesisResult _infer(
|
||||
Ort::MemoryInfo& memory_info,
|
||||
const std::vector<std::string>& text_list,
|
||||
const std::vector<std::string>& lang_list,
|
||||
const Style& style,
|
||||
int total_step,
|
||||
float speed = 1.05f
|
||||
);
|
||||
Config cfgs_;
|
||||
UnicodeProcessor* text_processor_;
|
||||
Ort::Session* dp_ort_;
|
||||
Ort::Session* text_enc_ort_;
|
||||
Ort::Session* vector_est_ort_;
|
||||
Ort::Session* vocoder_ort_;
|
||||
int sample_rate_;
|
||||
int base_chunk_size_;
|
||||
int chunk_compress_factor_;
|
||||
int ldim_;
|
||||
|
||||
void sampleNoisyLatent(
|
||||
const std::vector<float>& duration,
|
||||
std::vector<std::vector<std::vector<float>>>& noisy_latent,
|
||||
std::vector<std::vector<std::vector<float>>>& latent_mask
|
||||
);
|
||||
};
|
||||
|
||||
// Utility functions
|
||||
std::vector<std::vector<std::vector<float>>> lengthToMask(
|
||||
const std::vector<int64_t>& lengths, int max_len = -1
|
||||
);
|
||||
|
||||
std::vector<std::vector<std::vector<float>>> getLatentMask(
|
||||
const std::vector<int64_t>& wav_lengths,
|
||||
int base_chunk_size,
|
||||
int chunk_compress_factor
|
||||
);
|
||||
|
||||
// ONNX model loading
|
||||
struct OnnxModels {
|
||||
std::unique_ptr<Ort::Session> dp;
|
||||
std::unique_ptr<Ort::Session> text_enc;
|
||||
std::unique_ptr<Ort::Session> vector_est;
|
||||
std::unique_ptr<Ort::Session> vocoder;
|
||||
};
|
||||
|
||||
std::unique_ptr<Ort::Session> loadOnnx(
|
||||
Ort::Env& env,
|
||||
const std::string& onnx_path,
|
||||
const Ort::SessionOptions& opts
|
||||
);
|
||||
|
||||
OnnxModels loadOnnxAll(
|
||||
Ort::Env& env,
|
||||
const std::string& onnx_dir,
|
||||
const Ort::SessionOptions& opts
|
||||
);
|
||||
|
||||
// Configuration and processor loading
|
||||
Config loadCfgs(const std::string& onnx_dir);
|
||||
|
||||
std::unique_ptr<UnicodeProcessor> loadTextProcessor(const std::string& onnx_dir);
|
||||
|
||||
// Voice style loading
|
||||
Style loadVoiceStyle(const std::vector<std::string>& voice_style_paths, bool verbose = false);
|
||||
|
||||
// TextToSpeech loading
|
||||
std::unique_ptr<TextToSpeech> loadTextToSpeech(
|
||||
Ort::Env& env,
|
||||
const std::string& onnx_dir,
|
||||
bool use_gpu = false
|
||||
);
|
||||
|
||||
// WAV file writing
|
||||
void writeWavFile(
|
||||
const std::string& filename,
|
||||
const std::vector<float>& audio_data,
|
||||
int sample_rate
|
||||
);
|
||||
|
||||
// Tensor conversion utilities
|
||||
void clearTensorBuffers();
|
||||
|
||||
Ort::Value arrayToTensor(
|
||||
Ort::MemoryInfo& memory_info,
|
||||
const std::vector<std::vector<std::vector<float>>>& array,
|
||||
const std::vector<int64_t>& dims
|
||||
);
|
||||
|
||||
Ort::Value intArrayToTensor(
|
||||
Ort::MemoryInfo& memory_info,
|
||||
const std::vector<std::vector<int64_t>>& array,
|
||||
const std::vector<int64_t>& dims
|
||||
);
|
||||
|
||||
// JSON loading helpers
|
||||
std::vector<int64_t> loadJsonInt64(const std::string& file_path);
|
||||
|
||||
// Timer utility
|
||||
template<typename Func>
|
||||
auto timer(const std::string& name, Func&& func) -> decltype(func()) {
|
||||
auto start = std::chrono::high_resolution_clock::now();
|
||||
std::cout << name << "..." << std::endl;
|
||||
auto result = func();
|
||||
auto end = std::chrono::high_resolution_clock::now();
|
||||
std::chrono::duration<double> elapsed = end - start;
|
||||
std::cout << " -> " << name << " completed in "
|
||||
<< std::fixed << std::setprecision(2) << elapsed.count() << " sec" << std::endl;
|
||||
return result;
|
||||
}
|
||||
|
||||
// Sanitize filename
|
||||
std::string sanitizeFilename(const std::string& text, int max_len);
|
||||
|
||||
// Chunk text into manageable segments
|
||||
std::vector<std::string> chunkText(const std::string& text, int max_len = 300);
|
||||
41
csharp/.gitignore
vendored
Normal file
@@ -0,0 +1,41 @@
|
||||
# Build results
|
||||
bin/
|
||||
obj/
|
||||
[Dd]ebug/
|
||||
[Rr]elease/
|
||||
x64/
|
||||
x86/
|
||||
[Aa]rm/
|
||||
[Aa]rm64/
|
||||
bld/
|
||||
[Bb]in/
|
||||
[Oo]bj/
|
||||
[Ll]og/
|
||||
|
||||
# Visual Studio files
|
||||
.vs/
|
||||
*.suo
|
||||
*.user
|
||||
*.userosscache
|
||||
*.sln.docstates
|
||||
*.userprefs
|
||||
|
||||
# Rider
|
||||
.idea/
|
||||
*.sln.iml
|
||||
|
||||
# User-specific files
|
||||
*.rsuser
|
||||
*.suo
|
||||
*.user
|
||||
*.userosscache
|
||||
*.sln.docstates
|
||||
|
||||
# Output directory
|
||||
results/*.wav
|
||||
|
||||
# OS files
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
|
||||
|
||||
171
csharp/ExampleONNX.cs
Normal file
@@ -0,0 +1,171 @@
|
||||
using System;
|
||||
using System.Collections.Generic;
|
||||
using System.IO;
|
||||
using System.Linq;
|
||||
using System.Media;
|
||||
|
||||
namespace Supertonic
|
||||
{
|
||||
class Program
|
||||
{
|
||||
class Args
|
||||
{
|
||||
public bool UseGpu { get; set; } = false;
|
||||
public string OnnxDir { get; set; } = "./assets/onnx";
|
||||
public int TotalStep { get; set; } = 5;
|
||||
public float Speed { get; set; } = 1.05f;
|
||||
public int NTest { get; set; } = 4;
|
||||
public List<string> VoiceStyle { get; set; } = new List<string> { "assets/voice_styles/F2.json" };
|
||||
public List<string> Text { get; set; } = new List<string>
|
||||
{
|
||||
"동해물과 백두산이 마르고 닳도록 하느님이 보우하사. 우리 나라 만세~~"
|
||||
};
|
||||
public List<string> Lang { get; set; } = new List<string> { "ko" };
|
||||
public string SaveDir { get; set; } = "results";
|
||||
public bool Batch { get; set; } = false;
|
||||
public int? Seed { get; set; } = null;
|
||||
public float PreSilence { get; set; } = 0.2f;
|
||||
}
|
||||
|
||||
static Args ParseArgs(string[] args)
|
||||
{
|
||||
var result = new Args();
|
||||
|
||||
for (int i = 0; i < args.Length; i++)
|
||||
{
|
||||
switch (args[i])
|
||||
{
|
||||
case "--use-gpu":
|
||||
result.UseGpu = true;
|
||||
break;
|
||||
case "--batch":
|
||||
result.Batch = true;
|
||||
break;
|
||||
case "--onnx-dir" when i + 1 < args.Length:
|
||||
result.OnnxDir = args[++i];
|
||||
break;
|
||||
case "--total-step" when i + 1 < args.Length:
|
||||
result.TotalStep = int.Parse(args[++i]);
|
||||
break;
|
||||
case "--speed" when i + 1 < args.Length:
|
||||
result.Speed = float.Parse(args[++i]);
|
||||
break;
|
||||
case "--n-test" when i + 1 < args.Length:
|
||||
result.NTest = int.Parse(args[++i]);
|
||||
break;
|
||||
case "--voice-style" when i + 1 < args.Length:
|
||||
result.VoiceStyle = args[++i].Split(',').ToList();
|
||||
break;
|
||||
case "--text" when i + 1 < args.Length:
|
||||
result.Text = args[++i].Split('|').ToList();
|
||||
break;
|
||||
case "--lang" when i + 1 < args.Length:
|
||||
result.Lang = args[++i].Split(',').ToList();
|
||||
break;
|
||||
case "--save-dir" when i + 1 < args.Length:
|
||||
result.SaveDir = args[++i];
|
||||
break;
|
||||
case "--seed" when i + 1 < args.Length:
|
||||
result.Seed = int.Parse(args[++i]);
|
||||
break;
|
||||
case "--pre-silence" when i + 1 < args.Length:
|
||||
result.PreSilence = float.Parse(args[++i]);
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
static void Main(string[] args)
|
||||
{
|
||||
Console.WriteLine("=== TTS Inference with ONNX Runtime (C#) ===\n");
|
||||
Console.WriteLine("sample seed : 371279630");
|
||||
|
||||
// --- 1. Parse arguments --- //
|
||||
var parsedArgs = ParseArgs(args);
|
||||
int totalStep = parsedArgs.TotalStep;
|
||||
float speed = parsedArgs.Speed;
|
||||
int nTest = parsedArgs.NTest;
|
||||
string saveDir = parsedArgs.SaveDir;
|
||||
var voiceStylePaths = parsedArgs.VoiceStyle;
|
||||
var textList = parsedArgs.Text;
|
||||
var langList = parsedArgs.Lang;
|
||||
bool batch = parsedArgs.Batch;
|
||||
|
||||
if (voiceStylePaths.Count != textList.Count)
|
||||
{
|
||||
throw new ArgumentException(
|
||||
$"Number of voice styles ({voiceStylePaths.Count}) must match number of texts ({textList.Count})");
|
||||
}
|
||||
int bsz = voiceStylePaths.Count;
|
||||
|
||||
// --- 2. Load Text to Speech --- //
|
||||
var textToSpeech = Helper.LoadTextToSpeech(parsedArgs.OnnxDir, parsedArgs.UseGpu);
|
||||
Console.WriteLine();
|
||||
|
||||
// --- 3. Load Voice Style --- //
|
||||
var style = Helper.LoadVoiceStyle(voiceStylePaths, verbose: true);
|
||||
|
||||
// --- 4. Synthesize speech --- //
|
||||
Random seedGenerator = new Random();
|
||||
for (int n = 0; n < nTest; n++)
|
||||
{
|
||||
int currentSeed = parsedArgs.Seed ?? seedGenerator.Next();
|
||||
Console.WriteLine($"\n[{n + 1}/{nTest}] Starting synthesis (Seed: {currentSeed})...");
|
||||
|
||||
var (wav, duration) = Helper.Timer("Generating speech from text", () =>
|
||||
{
|
||||
if (batch)
|
||||
{
|
||||
return textToSpeech.Batch(textList, langList, style, totalStep, speed, currentSeed);
|
||||
}
|
||||
else
|
||||
{
|
||||
return textToSpeech.Call(textList[0], langList[0], style, totalStep, speed, seed: currentSeed);
|
||||
}
|
||||
});
|
||||
|
||||
if (!Directory.Exists(saveDir))
|
||||
{
|
||||
Directory.CreateDirectory(saveDir);
|
||||
}
|
||||
|
||||
for (int b = 0; b < bsz; b++)
|
||||
{
|
||||
string fname = $"{Helper.SanitizeFilename(textList[b], 20)}_{n + 1}_s{currentSeed}.wav";
|
||||
|
||||
int wavLen = (int)(textToSpeech.SampleRate * duration[b]);
|
||||
|
||||
// --- Add Pre-Silence (Delay) --- //
|
||||
int silenceSamples = (int)(textToSpeech.SampleRate * parsedArgs.PreSilence);
|
||||
var wavOut = new float[wavLen + silenceSamples];
|
||||
|
||||
// The array is initialized to 0 by default, so we just copy the audio after the silence
|
||||
Array.Copy(wav, b * wav.Length / bsz, wavOut, silenceSamples, Math.Min(wavLen, wav.Length / bsz));
|
||||
|
||||
string outputPath = Path.Combine(saveDir, fname);
|
||||
Helper.WriteWavFile(outputPath, wavOut, textToSpeech.SampleRate);
|
||||
Console.WriteLine($"Saved: {outputPath}");
|
||||
|
||||
// --- Play the generated audio --- //
|
||||
try
|
||||
{
|
||||
using (var player = new SoundPlayer(outputPath))
|
||||
{
|
||||
Console.WriteLine("Playing audio...");
|
||||
player.PlaySync();
|
||||
}
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
Console.WriteLine($"Warning: Could not play audio. {ex.Message}");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Console.WriteLine("\n=== Synthesis completed successfully! ===");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
861
csharp/Helper.cs
Normal file
@@ -0,0 +1,861 @@
|
||||
using System;
|
||||
using System.Collections.Generic;
|
||||
using System.IO;
|
||||
using System.Linq;
|
||||
using System.Text;
|
||||
using System.Text.Json;
|
||||
using System.Text.RegularExpressions;
|
||||
using Microsoft.ML.OnnxRuntime;
|
||||
using Microsoft.ML.OnnxRuntime.Tensors;
|
||||
|
||||
namespace Supertonic
|
||||
{
|
||||
// Available languages for multilingual TTS
|
||||
public static class Languages
|
||||
{
|
||||
public static readonly string[] Available = { "en", "ko", "es", "pt", "fr" };
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Configuration classes
|
||||
// ============================================================================
|
||||
|
||||
public class Config
|
||||
{
|
||||
public AEConfig AE { get; set; } = null!;
|
||||
public TTLConfig TTL { get; set; } = null!;
|
||||
|
||||
public class AEConfig
|
||||
{
|
||||
public int SampleRate { get; set; }
|
||||
public int BaseChunkSize { get; set; }
|
||||
}
|
||||
|
||||
public class TTLConfig
|
||||
{
|
||||
public int ChunkCompressFactor { get; set; }
|
||||
public int LatentDim { get; set; }
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Style class
|
||||
// ============================================================================
|
||||
|
||||
public class Style
|
||||
{
|
||||
public float[] Ttl { get; set; }
|
||||
public long[] TtlShape { get; set; }
|
||||
public float[] Dp { get; set; }
|
||||
public long[] DpShape { get; set; }
|
||||
|
||||
public Style(float[] ttl, long[] ttlShape, float[] dp, long[] dpShape)
|
||||
{
|
||||
Ttl = ttl;
|
||||
TtlShape = ttlShape;
|
||||
Dp = dp;
|
||||
DpShape = dpShape;
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Unicode text processor
|
||||
// ============================================================================
|
||||
|
||||
public class UnicodeProcessor
|
||||
{
|
||||
private readonly Dictionary<int, long> _indexer;
|
||||
|
||||
public UnicodeProcessor(string unicodeIndexerPath)
|
||||
{
|
||||
var json = File.ReadAllText(unicodeIndexerPath);
|
||||
var indexerArray = JsonSerializer.Deserialize<long[]>(json) ?? throw new Exception("Failed to load indexer");
|
||||
_indexer = new Dictionary<int, long>();
|
||||
for (int i = 0; i < indexerArray.Length; i++)
|
||||
{
|
||||
_indexer[i] = indexerArray[i];
|
||||
}
|
||||
}
|
||||
|
||||
private static string RemoveEmojis(string text)
|
||||
{
|
||||
var result = new StringBuilder();
|
||||
for (int i = 0; i < text.Length; i++)
|
||||
{
|
||||
int codePoint;
|
||||
if (char.IsHighSurrogate(text[i]) && i + 1 < text.Length && char.IsLowSurrogate(text[i + 1]))
|
||||
{
|
||||
// Get the full code point from surrogate pair
|
||||
codePoint = char.ConvertToUtf32(text[i], text[i + 1]);
|
||||
i++; // Skip the low surrogate
|
||||
}
|
||||
else
|
||||
{
|
||||
codePoint = text[i];
|
||||
}
|
||||
|
||||
// Check if code point is in emoji ranges
|
||||
bool isEmoji = (codePoint >= 0x1F600 && codePoint <= 0x1F64F) ||
|
||||
(codePoint >= 0x1F300 && codePoint <= 0x1F5FF) ||
|
||||
(codePoint >= 0x1F680 && codePoint <= 0x1F6FF) ||
|
||||
(codePoint >= 0x1F700 && codePoint <= 0x1F77F) ||
|
||||
(codePoint >= 0x1F780 && codePoint <= 0x1F7FF) ||
|
||||
(codePoint >= 0x1F800 && codePoint <= 0x1F8FF) ||
|
||||
(codePoint >= 0x1F900 && codePoint <= 0x1F9FF) ||
|
||||
(codePoint >= 0x1FA00 && codePoint <= 0x1FA6F) ||
|
||||
(codePoint >= 0x1FA70 && codePoint <= 0x1FAFF) ||
|
||||
(codePoint >= 0x2600 && codePoint <= 0x26FF) ||
|
||||
(codePoint >= 0x2700 && codePoint <= 0x27BF) ||
|
||||
(codePoint >= 0x1F1E6 && codePoint <= 0x1F1FF);
|
||||
|
||||
if (!isEmoji)
|
||||
{
|
||||
if (codePoint > 0xFFFF)
|
||||
{
|
||||
// Add back as surrogate pair
|
||||
result.Append(char.ConvertFromUtf32(codePoint));
|
||||
}
|
||||
else
|
||||
{
|
||||
result.Append((char)codePoint);
|
||||
}
|
||||
}
|
||||
}
|
||||
return result.ToString();
|
||||
}
|
||||
|
||||
private string PreprocessText(string text, string lang)
|
||||
{
|
||||
// TODO: Need advanced normalizer for better performance
|
||||
text = text.Normalize(NormalizationForm.FormKD);
|
||||
|
||||
// Remove emojis (wide Unicode range)
|
||||
// C# doesn't support \u{...} syntax in regex, so we use character filtering instead
|
||||
text = RemoveEmojis(text);
|
||||
|
||||
// Replace various dashes and symbols
|
||||
var replacements = new Dictionary<string, string>
|
||||
{
|
||||
{"–", "-"}, // en dash
|
||||
{"‑", "-"}, // non-breaking hyphen
|
||||
{"—", "-"}, // em dash
|
||||
{"_", " "}, // underscore
|
||||
{"\u201C", "\""}, // left double quote
|
||||
{"\u201D", "\""}, // right double quote
|
||||
{"\u2018", "'"}, // left single quote
|
||||
{"\u2019", "'"}, // right single quote
|
||||
{"´", "'"}, // acute accent
|
||||
{"`", "'"}, // grave accent
|
||||
{"[", " "}, // left bracket
|
||||
{"]", " "}, // right bracket
|
||||
{"|", " "}, // vertical bar
|
||||
{"/", " "}, // slash
|
||||
{"#", " "}, // hash
|
||||
{"→", " "}, // right arrow
|
||||
{"←", " "}, // left arrow
|
||||
};
|
||||
|
||||
foreach (var kvp in replacements)
|
||||
{
|
||||
text = text.Replace(kvp.Key, kvp.Value);
|
||||
}
|
||||
|
||||
// Remove special symbols
|
||||
text = Regex.Replace(text, @"[♥☆♡©\\]", "");
|
||||
|
||||
// Replace known expressions
|
||||
var exprReplacements = new Dictionary<string, string>
|
||||
{
|
||||
{"@", " at "},
|
||||
{"e.g.,", "for example, "},
|
||||
{"i.e.,", "that is, "},
|
||||
};
|
||||
|
||||
foreach (var kvp in exprReplacements)
|
||||
{
|
||||
text = text.Replace(kvp.Key, kvp.Value);
|
||||
}
|
||||
|
||||
// Fix spacing around punctuation
|
||||
text = Regex.Replace(text, @" ,", ",");
|
||||
text = Regex.Replace(text, @" \.", ".");
|
||||
text = Regex.Replace(text, @" !", "!");
|
||||
text = Regex.Replace(text, @" \?", "?");
|
||||
text = Regex.Replace(text, @" ;", ";");
|
||||
text = Regex.Replace(text, @" :", ":");
|
||||
text = Regex.Replace(text, @" '", "'");
|
||||
|
||||
// Remove duplicate quotes
|
||||
while (text.Contains("\"\""))
|
||||
{
|
||||
text = text.Replace("\"\"", "\"");
|
||||
}
|
||||
while (text.Contains("''"))
|
||||
{
|
||||
text = text.Replace("''", "'");
|
||||
}
|
||||
while (text.Contains("``"))
|
||||
{
|
||||
text = text.Replace("``", "`");
|
||||
}
|
||||
|
||||
// Remove extra spaces
|
||||
text = Regex.Replace(text, @"\s+", " ").Trim();
|
||||
|
||||
// If text doesn't end with punctuation, quotes, or closing brackets, add a period
|
||||
if (!Regex.IsMatch(text, @"[.!?;:,'\u0022\u201C\u201D\u2018\u2019)\]}…。」』】〉》›»]$"))
|
||||
{
|
||||
text += ".";
|
||||
}
|
||||
|
||||
// Validate language
|
||||
if (!Languages.Available.Contains(lang))
|
||||
{
|
||||
throw new ArgumentException($"Invalid language: {lang}. Available: {string.Join(", ", Languages.Available)}");
|
||||
}
|
||||
|
||||
// Wrap text with language tags
|
||||
text = $"<{lang}>" + text + $"</{lang}>";
|
||||
|
||||
return text;
|
||||
}
|
||||
|
||||
private int[] TextToUnicodeValues(string text)
|
||||
{
|
||||
return text.Select(c => (int)c).ToArray();
|
||||
}
|
||||
|
||||
private float[][][] GetTextMask(long[] textIdsLengths)
|
||||
{
|
||||
return Helper.LengthToMask(textIdsLengths);
|
||||
}
|
||||
|
||||
public (long[][] textIds, float[][][] textMask) Call(List<string> textList, List<string> langList)
|
||||
{
|
||||
var processedTexts = textList.Select((t, i) => PreprocessText(t, langList[i])).ToList();
|
||||
var textIdsLengths = processedTexts.Select(t => (long)t.Length).ToArray();
|
||||
long maxLen = textIdsLengths.Max();
|
||||
|
||||
var textIds = new long[textList.Count][];
|
||||
for (int i = 0; i < processedTexts.Count; i++)
|
||||
{
|
||||
textIds[i] = new long[maxLen];
|
||||
var unicodeVals = TextToUnicodeValues(processedTexts[i]);
|
||||
for (int j = 0; j < unicodeVals.Length; j++)
|
||||
{
|
||||
if (_indexer.TryGetValue(unicodeVals[j], out long val))
|
||||
{
|
||||
textIds[i][j] = val;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
var textMask = GetTextMask(textIdsLengths);
|
||||
return (textIds, textMask);
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// TextToSpeech class
|
||||
// ============================================================================
|
||||
|
||||
public class TextToSpeech
|
||||
{
|
||||
private readonly Config _cfgs;
|
||||
private readonly UnicodeProcessor _textProcessor;
|
||||
private readonly InferenceSession _dpOrt;
|
||||
private readonly InferenceSession _textEncOrt;
|
||||
private readonly InferenceSession _vectorEstOrt;
|
||||
private readonly InferenceSession _vocoderOrt;
|
||||
public readonly int SampleRate;
|
||||
private readonly int _baseChunkSize;
|
||||
private readonly int _chunkCompressFactor;
|
||||
private readonly int _ldim;
|
||||
|
||||
public TextToSpeech(
|
||||
Config cfgs,
|
||||
UnicodeProcessor textProcessor,
|
||||
InferenceSession dpOrt,
|
||||
InferenceSession textEncOrt,
|
||||
InferenceSession vectorEstOrt,
|
||||
InferenceSession vocoderOrt)
|
||||
{
|
||||
_cfgs = cfgs;
|
||||
_textProcessor = textProcessor;
|
||||
_dpOrt = dpOrt;
|
||||
_textEncOrt = textEncOrt;
|
||||
_vectorEstOrt = vectorEstOrt;
|
||||
_vocoderOrt = vocoderOrt;
|
||||
SampleRate = cfgs.AE.SampleRate;
|
||||
_baseChunkSize = cfgs.AE.BaseChunkSize;
|
||||
_chunkCompressFactor = cfgs.TTL.ChunkCompressFactor;
|
||||
_ldim = cfgs.TTL.LatentDim;
|
||||
}
|
||||
|
||||
private (float[][][] noisyLatent, float[][][] latentMask) SampleNoisyLatent(float[] duration, int seed)
|
||||
{
|
||||
int bsz = duration.Length;
|
||||
float wavLenMax = duration.Max() * SampleRate;
|
||||
var wavLengths = duration.Select(d => (long)(d * SampleRate)).ToArray();
|
||||
int chunkSize = _baseChunkSize * _chunkCompressFactor;
|
||||
int latentLen = (int)((wavLenMax + chunkSize - 1) / chunkSize);
|
||||
int latentDim = _ldim * _chunkCompressFactor;
|
||||
|
||||
// Generate random noise with fixed seed
|
||||
var random = new Random(seed);
|
||||
var noisyLatent = new float[bsz][][];
|
||||
for (int b = 0; b < bsz; b++)
|
||||
{
|
||||
noisyLatent[b] = new float[latentDim][];
|
||||
for (int d = 0; d < latentDim; d++)
|
||||
{
|
||||
noisyLatent[b][d] = new float[latentLen];
|
||||
for (int t = 0; t < latentLen; t++)
|
||||
{
|
||||
// Box-Muller transform for normal distribution
|
||||
double u1 = 1.0 - random.NextDouble();
|
||||
double u2 = 1.0 - random.NextDouble();
|
||||
noisyLatent[b][d][t] = (float)(Math.Sqrt(-2.0 * Math.Log(u1)) * Math.Cos(2.0 * Math.PI * u2));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
var latentMask = Helper.GetLatentMask(wavLengths, _baseChunkSize, _chunkCompressFactor);
|
||||
|
||||
// Apply mask
|
||||
for (int b = 0; b < bsz; b++)
|
||||
{
|
||||
for (int d = 0; d < latentDim; d++)
|
||||
{
|
||||
for (int t = 0; t < latentLen; t++)
|
||||
{
|
||||
noisyLatent[b][d][t] *= latentMask[b][0][t];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return (noisyLatent, latentMask);
|
||||
}
|
||||
|
||||
private (float[] wav, float[] duration) _Infer(List<string> textList, List<string> langList, Style style, int totalStep, float speed = 1.05f, int seed = 42)
|
||||
{
|
||||
int bsz = textList.Count;
|
||||
if (bsz != style.TtlShape[0])
|
||||
{
|
||||
throw new ArgumentException("Number of texts must match number of style vectors");
|
||||
}
|
||||
|
||||
// Process text
|
||||
var (textIds, textMask) = _textProcessor.Call(textList, langList);
|
||||
var textIdsShape = new long[] { bsz, textIds[0].Length };
|
||||
var textMaskShape = new long[] { bsz, 1, textMask[0][0].Length };
|
||||
|
||||
var textIdsTensor = Helper.IntArrayToTensor(textIds, textIdsShape);
|
||||
var textMaskTensor = Helper.ArrayToTensor(textMask, textMaskShape);
|
||||
|
||||
var styleTtlTensor = new DenseTensor<float>(style.Ttl, style.TtlShape.Select(x => (int)x).ToArray());
|
||||
var styleDpTensor = new DenseTensor<float>(style.Dp, style.DpShape.Select(x => (int)x).ToArray());
|
||||
|
||||
// Run duration predictor
|
||||
var dpInputs = new List<NamedOnnxValue>
|
||||
{
|
||||
NamedOnnxValue.CreateFromTensor("text_ids", textIdsTensor),
|
||||
NamedOnnxValue.CreateFromTensor("style_dp", styleDpTensor),
|
||||
NamedOnnxValue.CreateFromTensor("text_mask", textMaskTensor)
|
||||
};
|
||||
using var dpOutputs = _dpOrt.Run(dpInputs);
|
||||
var durOnnx = dpOutputs.First(o => o.Name == "duration").AsTensor<float>().ToArray();
|
||||
|
||||
// Apply speed factor to duration
|
||||
for (int i = 0; i < durOnnx.Length; i++)
|
||||
{
|
||||
durOnnx[i] /= speed;
|
||||
}
|
||||
|
||||
// Run text encoder
|
||||
var textEncInputs = new List<NamedOnnxValue>
|
||||
{
|
||||
NamedOnnxValue.CreateFromTensor("text_ids", textIdsTensor),
|
||||
NamedOnnxValue.CreateFromTensor("style_ttl", styleTtlTensor),
|
||||
NamedOnnxValue.CreateFromTensor("text_mask", textMaskTensor)
|
||||
};
|
||||
using var textEncOutputs = _textEncOrt.Run(textEncInputs);
|
||||
var textEmbTensor = textEncOutputs.First(o => o.Name == "text_emb").AsTensor<float>();
|
||||
|
||||
// Sample noisy latent
|
||||
var (xt, latentMask) = SampleNoisyLatent(durOnnx, seed);
|
||||
var latentShape = new long[] { bsz, xt[0].Length, xt[0][0].Length };
|
||||
var latentMaskShape = new long[] { bsz, 1, latentMask[0][0].Length };
|
||||
|
||||
var totalStepArray = Enumerable.Repeat((float)totalStep, bsz).ToArray();
|
||||
|
||||
// Iterative denoising
|
||||
for (int step = 0; step < totalStep; step++)
|
||||
{
|
||||
var currentStepArray = Enumerable.Repeat((float)step, bsz).ToArray();
|
||||
|
||||
var vectorEstInputs = new List<NamedOnnxValue>
|
||||
{
|
||||
NamedOnnxValue.CreateFromTensor("noisy_latent", Helper.ArrayToTensor(xt, latentShape)),
|
||||
NamedOnnxValue.CreateFromTensor("text_emb", textEmbTensor),
|
||||
NamedOnnxValue.CreateFromTensor("style_ttl", styleTtlTensor),
|
||||
NamedOnnxValue.CreateFromTensor("text_mask", textMaskTensor),
|
||||
NamedOnnxValue.CreateFromTensor("latent_mask", Helper.ArrayToTensor(latentMask, latentMaskShape)),
|
||||
NamedOnnxValue.CreateFromTensor("total_step", new DenseTensor<float>(totalStepArray, new int[] { bsz })),
|
||||
NamedOnnxValue.CreateFromTensor("current_step", new DenseTensor<float>(currentStepArray, new int[] { bsz }))
|
||||
};
|
||||
|
||||
using var vectorEstOutputs = _vectorEstOrt.Run(vectorEstInputs);
|
||||
var denoisedLatent = vectorEstOutputs.First(o => o.Name == "denoised_latent").AsTensor<float>();
|
||||
|
||||
// Update xt
|
||||
int idx = 0;
|
||||
for (int b = 0; b < bsz; b++)
|
||||
{
|
||||
for (int d = 0; d < xt[b].Length; d++)
|
||||
{
|
||||
for (int t = 0; t < xt[b][d].Length; t++)
|
||||
{
|
||||
xt[b][d][t] = denoisedLatent.GetValue(idx++);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Run vocoder
|
||||
var vocoderInputs = new List<NamedOnnxValue>
|
||||
{
|
||||
NamedOnnxValue.CreateFromTensor("latent", Helper.ArrayToTensor(xt, latentShape))
|
||||
};
|
||||
using var vocoderOutputs = _vocoderOrt.Run(vocoderInputs);
|
||||
var wavTensor = vocoderOutputs.First(o => o.Name == "wav_tts").AsTensor<float>();
|
||||
|
||||
return (wavTensor.ToArray(), durOnnx);
|
||||
}
|
||||
|
||||
public (float[] wav, float[] duration) Call(string text, string lang, Style style, int totalStep, float speed = 1.05f, float silenceDuration = 0.3f, int seed = 42)
|
||||
{
|
||||
if (style.TtlShape[0] != 1)
|
||||
{
|
||||
throw new ArgumentException("Single speaker text to speech only supports single style");
|
||||
}
|
||||
|
||||
int maxLen = lang == "ko" ? 120 : 300;
|
||||
var textList = Helper.ChunkText(text, maxLen);
|
||||
var wavCat = new List<float>();
|
||||
float durCat = 0.0f;
|
||||
|
||||
foreach (var chunk in textList)
|
||||
{
|
||||
var (wav, duration) = _Infer(new List<string> { chunk }, new List<string> { lang }, style, totalStep, speed, seed);
|
||||
|
||||
if (wavCat.Count == 0)
|
||||
{
|
||||
wavCat.AddRange(wav);
|
||||
durCat = duration[0];
|
||||
}
|
||||
else
|
||||
{
|
||||
int silenceLen = (int)(silenceDuration * SampleRate);
|
||||
var silence = new float[silenceLen];
|
||||
wavCat.AddRange(silence);
|
||||
wavCat.AddRange(wav);
|
||||
durCat += duration[0] + silenceDuration;
|
||||
}
|
||||
}
|
||||
|
||||
return (wavCat.ToArray(), new float[] { durCat });
|
||||
}
|
||||
|
||||
public (float[] wav, float[] duration) Batch(List<string> textList, List<string> langList, Style style, int totalStep, float speed = 1.05f, int seed = 42)
|
||||
{
|
||||
return _Infer(textList, langList, style, totalStep, speed, seed);
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Helper class with utility functions
|
||||
// ============================================================================
|
||||
|
||||
public static class Helper
|
||||
{
|
||||
// ============================================================================
|
||||
// Utility functions
|
||||
// ============================================================================
|
||||
|
||||
public static float[][][] LengthToMask(long[] lengths, long maxLen = -1)
|
||||
{
|
||||
if (maxLen == -1)
|
||||
{
|
||||
maxLen = lengths.Max();
|
||||
}
|
||||
|
||||
var mask = new float[lengths.Length][][];
|
||||
for (int i = 0; i < lengths.Length; i++)
|
||||
{
|
||||
mask[i] = new float[1][];
|
||||
mask[i][0] = new float[maxLen];
|
||||
for (int j = 0; j < maxLen; j++)
|
||||
{
|
||||
mask[i][0][j] = j < lengths[i] ? 1.0f : 0.0f;
|
||||
}
|
||||
}
|
||||
return mask;
|
||||
}
|
||||
|
||||
public static float[][][] GetLatentMask(long[] wavLengths, int baseChunkSize, int chunkCompressFactor)
|
||||
{
|
||||
int latentSize = baseChunkSize * chunkCompressFactor;
|
||||
var latentLengths = wavLengths.Select(len => (len + latentSize - 1) / latentSize).ToArray();
|
||||
return LengthToMask(latentLengths);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// ONNX model loading
|
||||
// ============================================================================
|
||||
|
||||
public static InferenceSession LoadOnnx(string onnxPath, SessionOptions opts)
|
||||
{
|
||||
return new InferenceSession(onnxPath, opts);
|
||||
}
|
||||
|
||||
public static (InferenceSession dp, InferenceSession textEnc, InferenceSession vectorEst, InferenceSession vocoder)
|
||||
LoadOnnxAll(string onnxDir, SessionOptions opts)
|
||||
{
|
||||
var dpPath = Path.Combine(onnxDir, "duration_predictor.onnx");
|
||||
var textEncPath = Path.Combine(onnxDir, "text_encoder.onnx");
|
||||
var vectorEstPath = Path.Combine(onnxDir, "vector_estimator.onnx");
|
||||
var vocoderPath = Path.Combine(onnxDir, "vocoder.onnx");
|
||||
|
||||
return (
|
||||
LoadOnnx(dpPath, opts),
|
||||
LoadOnnx(textEncPath, opts),
|
||||
LoadOnnx(vectorEstPath, opts),
|
||||
LoadOnnx(vocoderPath, opts)
|
||||
);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Configuration loading
|
||||
// ============================================================================
|
||||
|
||||
public static Config LoadCfgs(string onnxDir)
|
||||
{
|
||||
var cfgPath = Path.Combine(onnxDir, "tts.json");
|
||||
var json = File.ReadAllText(cfgPath);
|
||||
|
||||
using var doc = JsonDocument.Parse(json);
|
||||
var root = doc.RootElement;
|
||||
|
||||
return new Config
|
||||
{
|
||||
AE = new Config.AEConfig
|
||||
{
|
||||
SampleRate = root.GetProperty("ae").GetProperty("sample_rate").GetInt32(),
|
||||
BaseChunkSize = root.GetProperty("ae").GetProperty("base_chunk_size").GetInt32()
|
||||
},
|
||||
TTL = new Config.TTLConfig
|
||||
{
|
||||
ChunkCompressFactor = root.GetProperty("ttl").GetProperty("chunk_compress_factor").GetInt32(),
|
||||
LatentDim = root.GetProperty("ttl").GetProperty("latent_dim").GetInt32()
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
public static UnicodeProcessor LoadTextProcessor(string onnxDir)
|
||||
{
|
||||
var unicodeIndexerPath = Path.Combine(onnxDir, "unicode_indexer.json");
|
||||
return new UnicodeProcessor(unicodeIndexerPath);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Voice style loading
|
||||
// ============================================================================
|
||||
|
||||
public static Style LoadVoiceStyle(List<string> voiceStylePaths, bool verbose = false)
|
||||
{
|
||||
int bsz = voiceStylePaths.Count;
|
||||
|
||||
// Read first file to get dimensions
|
||||
var firstJson = File.ReadAllText(voiceStylePaths[0]);
|
||||
using var firstDoc = JsonDocument.Parse(firstJson);
|
||||
var firstRoot = firstDoc.RootElement;
|
||||
|
||||
var ttlDims = ParseInt64Array(firstRoot.GetProperty("style_ttl").GetProperty("dims"));
|
||||
var dpDims = ParseInt64Array(firstRoot.GetProperty("style_dp").GetProperty("dims"));
|
||||
|
||||
long ttlDim1 = ttlDims[1];
|
||||
long ttlDim2 = ttlDims[2];
|
||||
long dpDim1 = dpDims[1];
|
||||
long dpDim2 = dpDims[2];
|
||||
|
||||
// Pre-allocate arrays with full batch size
|
||||
int ttlSize = (int)(bsz * ttlDim1 * ttlDim2);
|
||||
int dpSize = (int)(bsz * dpDim1 * dpDim2);
|
||||
var ttlFlat = new float[ttlSize];
|
||||
var dpFlat = new float[dpSize];
|
||||
|
||||
// Fill in the data
|
||||
for (int i = 0; i < bsz; i++)
|
||||
{
|
||||
var json = File.ReadAllText(voiceStylePaths[i]);
|
||||
using var doc = JsonDocument.Parse(json);
|
||||
var root = doc.RootElement;
|
||||
|
||||
// Flatten data
|
||||
var ttlData3D = ParseFloat3DArray(root.GetProperty("style_ttl").GetProperty("data"));
|
||||
var ttlDataFlat = new List<float>();
|
||||
foreach (var batch in ttlData3D)
|
||||
{
|
||||
foreach (var row in batch)
|
||||
{
|
||||
ttlDataFlat.AddRange(row);
|
||||
}
|
||||
}
|
||||
|
||||
var dpData3D = ParseFloat3DArray(root.GetProperty("style_dp").GetProperty("data"));
|
||||
var dpDataFlat = new List<float>();
|
||||
foreach (var batch in dpData3D)
|
||||
{
|
||||
foreach (var row in batch)
|
||||
{
|
||||
dpDataFlat.AddRange(row);
|
||||
}
|
||||
}
|
||||
|
||||
// Copy to pre-allocated array
|
||||
int ttlOffset = (int)(i * ttlDim1 * ttlDim2);
|
||||
ttlDataFlat.CopyTo(ttlFlat, ttlOffset);
|
||||
|
||||
int dpOffset = (int)(i * dpDim1 * dpDim2);
|
||||
dpDataFlat.CopyTo(dpFlat, dpOffset);
|
||||
}
|
||||
|
||||
var ttlShape = new long[] { bsz, ttlDim1, ttlDim2 };
|
||||
var dpShape = new long[] { bsz, dpDim1, dpDim2 };
|
||||
|
||||
if (verbose)
|
||||
{
|
||||
Console.WriteLine($"Loaded {bsz} voice styles");
|
||||
}
|
||||
|
||||
return new Style(ttlFlat, ttlShape, dpFlat, dpShape);
|
||||
}
|
||||
|
||||
private static float[][][] ParseFloat3DArray(JsonElement element)
|
||||
{
|
||||
var result = new List<float[][]>();
|
||||
foreach (var batch in element.EnumerateArray())
|
||||
{
|
||||
var batch2D = new List<float[]>();
|
||||
foreach (var row in batch.EnumerateArray())
|
||||
{
|
||||
var rowData = new List<float>();
|
||||
foreach (var val in row.EnumerateArray())
|
||||
{
|
||||
rowData.Add(val.GetSingle());
|
||||
}
|
||||
batch2D.Add(rowData.ToArray());
|
||||
}
|
||||
result.Add(batch2D.ToArray());
|
||||
}
|
||||
return result.ToArray();
|
||||
}
|
||||
|
||||
private static long[] ParseInt64Array(JsonElement element)
|
||||
{
|
||||
var result = new List<long>();
|
||||
foreach (var val in element.EnumerateArray())
|
||||
{
|
||||
result.Add(val.GetInt64());
|
||||
}
|
||||
return result.ToArray();
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// TextToSpeech loading
|
||||
// ============================================================================
|
||||
|
||||
public static TextToSpeech LoadTextToSpeech(string onnxDir, bool useGpu = false)
|
||||
{
|
||||
var opts = new SessionOptions();
|
||||
if (useGpu)
|
||||
{
|
||||
throw new NotImplementedException("GPU mode is not supported yet");
|
||||
}
|
||||
else
|
||||
{
|
||||
Console.WriteLine("Using CPU for inference");
|
||||
}
|
||||
|
||||
var cfgs = LoadCfgs(onnxDir);
|
||||
var (dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) = LoadOnnxAll(onnxDir, opts);
|
||||
var textProcessor = LoadTextProcessor(onnxDir);
|
||||
|
||||
return new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// WAV file writing
|
||||
// ============================================================================
|
||||
|
||||
public static void WriteWavFile(string filename, float[] audioData, int sampleRate)
|
||||
{
|
||||
using var writer = new BinaryWriter(File.Open(filename, FileMode.Create));
|
||||
|
||||
int numChannels = 1;
|
||||
int bitsPerSample = 16;
|
||||
int byteRate = sampleRate * numChannels * bitsPerSample / 8;
|
||||
short blockAlign = (short)(numChannels * bitsPerSample / 8);
|
||||
int dataSize = audioData.Length * bitsPerSample / 8;
|
||||
|
||||
// RIFF header
|
||||
writer.Write(Encoding.ASCII.GetBytes("RIFF"));
|
||||
writer.Write(36 + dataSize);
|
||||
writer.Write(Encoding.ASCII.GetBytes("WAVE"));
|
||||
|
||||
// fmt chunk
|
||||
writer.Write(Encoding.ASCII.GetBytes("fmt "));
|
||||
writer.Write(16); // fmt chunk size
|
||||
writer.Write((short)1); // audio format (PCM)
|
||||
writer.Write((short)numChannels);
|
||||
writer.Write(sampleRate);
|
||||
writer.Write(byteRate);
|
||||
writer.Write(blockAlign);
|
||||
writer.Write((short)bitsPerSample);
|
||||
|
||||
// data chunk
|
||||
writer.Write(Encoding.ASCII.GetBytes("data"));
|
||||
writer.Write(dataSize);
|
||||
|
||||
// Write audio data
|
||||
foreach (var sample in audioData)
|
||||
{
|
||||
float clamped = Math.Max(-1.0f, Math.Min(1.0f, sample));
|
||||
short intSample = (short)(clamped * 32767);
|
||||
writer.Write(intSample);
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Tensor conversion utilities
|
||||
// ============================================================================
|
||||
|
||||
public static DenseTensor<float> ArrayToTensor(float[][][] array, long[] dims)
|
||||
{
|
||||
var flat = new List<float>();
|
||||
foreach (var batch in array)
|
||||
{
|
||||
foreach (var row in batch)
|
||||
{
|
||||
flat.AddRange(row);
|
||||
}
|
||||
}
|
||||
return new DenseTensor<float>(flat.ToArray(), dims.Select(x => (int)x).ToArray());
|
||||
}
|
||||
|
||||
public static DenseTensor<long> IntArrayToTensor(long[][] array, long[] dims)
|
||||
{
|
||||
var flat = new List<long>();
|
||||
foreach (var row in array)
|
||||
{
|
||||
flat.AddRange(row);
|
||||
}
|
||||
return new DenseTensor<long>(flat.ToArray(), dims.Select(x => (int)x).ToArray());
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Timer utility
|
||||
// ============================================================================
|
||||
|
||||
public static T Timer<T>(string name, Func<T> func)
|
||||
{
|
||||
var start = DateTime.Now;
|
||||
Console.WriteLine($"{name}...");
|
||||
var result = func();
|
||||
var elapsed = (DateTime.Now - start).TotalSeconds;
|
||||
Console.WriteLine($" -> {name} completed in {elapsed:F2} sec");
|
||||
return result;
|
||||
}
|
||||
|
||||
public static string SanitizeFilename(string text, int maxLen)
|
||||
{
|
||||
var result = new StringBuilder();
|
||||
int count = 0;
|
||||
foreach (char c in text)
|
||||
{
|
||||
if (count >= maxLen) break;
|
||||
if (char.IsLetterOrDigit(c))
|
||||
{
|
||||
result.Append(c);
|
||||
}
|
||||
else
|
||||
{
|
||||
result.Append('_');
|
||||
}
|
||||
count++;
|
||||
}
|
||||
return result.ToString();
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Chunk text
|
||||
// ============================================================================
|
||||
|
||||
public static List<string> ChunkText(string text, int maxLen = 300)
|
||||
{
|
||||
var chunks = new List<string>();
|
||||
|
||||
// Split by paragraph (two or more newlines)
|
||||
var paragraphRegex = new Regex(@"\n\s*\n+");
|
||||
var paragraphs = paragraphRegex.Split(text.Trim())
|
||||
.Select(p => p.Trim())
|
||||
.Where(p => !string.IsNullOrEmpty(p))
|
||||
.ToList();
|
||||
|
||||
// Split by sentence boundaries, excluding abbreviations
|
||||
var sentenceRegex = new Regex(@"(?<!Mr\.|Mrs\.|Ms\.|Dr\.|Prof\.|Sr\.|Jr\.|Ph\.D\.|etc\.|e\.g\.|i\.e\.|vs\.|Inc\.|Ltd\.|Co\.|Corp\.|St\.|Ave\.|Blvd\.)(?<!\b[A-Z]\.)(?<=[.!?])\s+");
|
||||
|
||||
foreach (var paragraph in paragraphs)
|
||||
{
|
||||
var sentences = sentenceRegex.Split(paragraph);
|
||||
string currentChunk = "";
|
||||
|
||||
foreach (var sentence in sentences)
|
||||
{
|
||||
if (string.IsNullOrEmpty(sentence)) continue;
|
||||
|
||||
if (currentChunk.Length + sentence.Length + 1 <= maxLen)
|
||||
{
|
||||
if (!string.IsNullOrEmpty(currentChunk))
|
||||
{
|
||||
currentChunk += " ";
|
||||
}
|
||||
currentChunk += sentence;
|
||||
}
|
||||
else
|
||||
{
|
||||
if (!string.IsNullOrEmpty(currentChunk))
|
||||
{
|
||||
chunks.Add(currentChunk.Trim());
|
||||
}
|
||||
currentChunk = sentence;
|
||||
}
|
||||
}
|
||||
|
||||
if (!string.IsNullOrEmpty(currentChunk))
|
||||
{
|
||||
chunks.Add(currentChunk.Trim());
|
||||
}
|
||||
}
|
||||
|
||||
// If no chunks were created, return the original text
|
||||
if (chunks.Count == 0)
|
||||
{
|
||||
chunks.Add(text.Trim());
|
||||
}
|
||||
|
||||
return chunks;
|
||||
}
|
||||
}
|
||||
}
|
||||
8
csharp/Properties/launchSettings.json
Normal file
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"profiles": {
|
||||
"Supertonic": {
|
||||
"commandName": "Project",
|
||||
"commandLineArgs": "--seed 371279630"
|
||||
}
|
||||
}
|
||||
}
|
||||
137
csharp/README.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# TTS ONNX Inference Examples
|
||||
|
||||
This guide provides examples for running TTS inference using `ExampleONNX.cs`.
|
||||
|
||||
## 📰 Update News
|
||||
|
||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
||||
|
||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||
|
||||
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
|
||||
|
||||
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
|
||||
|
||||
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
|
||||
|
||||
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
- .NET 9.0 SDK or later
|
||||
- [Download .NET SDK](https://dotnet.microsoft.com/download)
|
||||
|
||||
### Install dependencies
|
||||
```bash
|
||||
dotnet restore
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
dotnet run
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
dotnet run -- \
|
||||
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
|
||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
|
||||
--lang en,ko \
|
||||
--batch
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use `--batch` flag to enable batch processing mode
|
||||
- Generate speech for 2 different voice-text pairs
|
||||
- Use male voice style (M1.json) for the first English text
|
||||
- Use female voice style (F1.json) for the second Korean text
|
||||
- Process both samples in a single batch (automatic text chunking disabled)
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
dotnet run -- \
|
||||
--total-step 10 \
|
||||
--voice-style assets/voice_styles/M1.json \
|
||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
### Example 4: Long-Form Inference
|
||||
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
|
||||
```bash
|
||||
dotnet run -- \
|
||||
--voice-style assets/voice_styles/M1.json \
|
||||
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Automatically split the long text into smaller chunks (max 300 characters by default)
|
||||
- Process each chunk separately while maintaining natural speech flow
|
||||
- Insert brief silences (0.3 seconds) between chunks for natural pacing
|
||||
- Combine all chunks into a single output audio file
|
||||
|
||||
**Note**: When using batch mode (`--batch`), automatic text chunking is disabled. Use non-batch mode for long-form text synthesis.
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `--use-gpu` | flag | False | Use GPU for inference (not supported yet) |
|
||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
|
||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) (comma-separated) |
|
||||
| `--text` | str+ | (long default text) | Text(s) to synthesize (pipe-separated: `|`) |
|
||||
| `--lang` | str+ | `en` | Language(s) for text(s): `en`, `ko`, `es`, `pt`, `fr` (comma-separated) |
|
||||
| `--save-dir` | str | `results` | Output directory |
|
||||
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
|
||||
- **Multilingual Support**: Use `--lang` to specify language(s). Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
|
||||
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
|
||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||
- **GPU Support**: GPU mode is not supported yet
|
||||
|
||||
## Building the Project
|
||||
|
||||
### Build for Release
|
||||
```bash
|
||||
dotnet build -c Release
|
||||
```
|
||||
|
||||
### Run the compiled executable
|
||||
```bash
|
||||
./bin/Release/net9.0/Supertonic
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
csharp/
|
||||
├── ExampleONNX.cs # Main inference script
|
||||
├── Helper.cs # Helper functions and classes
|
||||
├── Supertonic.csproj # Project configuration
|
||||
├── README.md # This file
|
||||
└── results/ # Output directory (created automatically)
|
||||
```
|
||||
|
||||
|
||||
18
csharp/Supertonic.csproj
Normal file
@@ -0,0 +1,18 @@
|
||||
<Project Sdk="Microsoft.NET.Sdk">
|
||||
|
||||
<PropertyGroup>
|
||||
<OutputType>Exe</OutputType>
|
||||
<TargetFramework>net9.0-windows</TargetFramework>
|
||||
<UseWindowsForms>true</UseWindowsForms>
|
||||
<LangVersion>13.0</LangVersion>
|
||||
<Nullable>enable</Nullable>
|
||||
</PropertyGroup>
|
||||
|
||||
<ItemGroup>
|
||||
<PackageReference Include="Microsoft.ML.OnnxRuntime" Version="1.20.1" />
|
||||
<PackageReference Include="System.Text.Json" Version="9.0.1" />
|
||||
</ItemGroup>
|
||||
|
||||
</Project>
|
||||
|
||||
|
||||
24
csharp/csharp.sln
Normal file
@@ -0,0 +1,24 @@
|
||||
Microsoft Visual Studio Solution File, Format Version 12.00
|
||||
# Visual Studio Version 17
|
||||
VisualStudioVersion = 17.5.2.0
|
||||
MinimumVisualStudioVersion = 10.0.40219.1
|
||||
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Supertonic", "Supertonic.csproj", "{869BE631-3CAF-8F33-CD9A-3A5788517967}"
|
||||
EndProject
|
||||
Global
|
||||
GlobalSection(SolutionConfigurationPlatforms) = preSolution
|
||||
Debug|Any CPU = Debug|Any CPU
|
||||
Release|Any CPU = Release|Any CPU
|
||||
EndGlobalSection
|
||||
GlobalSection(ProjectConfigurationPlatforms) = postSolution
|
||||
{869BE631-3CAF-8F33-CD9A-3A5788517967}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
|
||||
{869BE631-3CAF-8F33-CD9A-3A5788517967}.Debug|Any CPU.Build.0 = Debug|Any CPU
|
||||
{869BE631-3CAF-8F33-CD9A-3A5788517967}.Release|Any CPU.ActiveCfg = Release|Any CPU
|
||||
{869BE631-3CAF-8F33-CD9A-3A5788517967}.Release|Any CPU.Build.0 = Release|Any CPU
|
||||
EndGlobalSection
|
||||
GlobalSection(SolutionProperties) = preSolution
|
||||
HideSolutionNode = FALSE
|
||||
EndGlobalSection
|
||||
GlobalSection(ExtensibilityGlobals) = postSolution
|
||||
SolutionGuid = {2726B6AA-94CF-4D70-899D-0356CF025555}
|
||||
EndGlobalSection
|
||||
EndGlobal
|
||||
45
flutter/.gitignore
vendored
Normal file
@@ -0,0 +1,45 @@
|
||||
# Miscellaneous
|
||||
*.class
|
||||
*.log
|
||||
*.pyc
|
||||
*.swp
|
||||
.DS_Store
|
||||
.atom/
|
||||
.build/
|
||||
.buildlog/
|
||||
.history
|
||||
.svn/
|
||||
.swiftpm/
|
||||
migrate_working_dir/
|
||||
|
||||
# IntelliJ related
|
||||
*.iml
|
||||
*.ipr
|
||||
*.iws
|
||||
.idea/
|
||||
|
||||
# The .vscode folder contains launch configuration and tasks you configure in
|
||||
# VS Code which you may wish to be included in version control, so this line
|
||||
# is commented out by default.
|
||||
#.vscode/
|
||||
|
||||
# Flutter/Dart/Pub related
|
||||
**/doc/api/
|
||||
**/ios/Flutter/.last_build_id
|
||||
.dart_tool/
|
||||
.flutter-plugins-dependencies
|
||||
.pub-cache/
|
||||
.pub/
|
||||
/build/
|
||||
/coverage/
|
||||
|
||||
# Symbolication related
|
||||
app.*.symbols
|
||||
|
||||
# Obfuscation related
|
||||
app.*.map.json
|
||||
|
||||
# Android Studio will place build artifacts here
|
||||
/android/app/debug
|
||||
/android/app/profile
|
||||
/android/app/release
|
||||
30
flutter/.metadata
Normal file
@@ -0,0 +1,30 @@
|
||||
# This file tracks properties of this Flutter project.
|
||||
# Used by Flutter tool to assess capabilities and perform upgrades etc.
|
||||
#
|
||||
# This file should be version controlled and should not be manually edited.
|
||||
|
||||
version:
|
||||
revision: "19074d12f7eaf6a8180cd4036a430c1d76de904e"
|
||||
channel: "stable"
|
||||
|
||||
project_type: app
|
||||
|
||||
# Tracks metadata for the flutter migrate command
|
||||
migration:
|
||||
platforms:
|
||||
- platform: root
|
||||
create_revision: 19074d12f7eaf6a8180cd4036a430c1d76de904e
|
||||
base_revision: 19074d12f7eaf6a8180cd4036a430c1d76de904e
|
||||
- platform: macos
|
||||
create_revision: 19074d12f7eaf6a8180cd4036a430c1d76de904e
|
||||
base_revision: 19074d12f7eaf6a8180cd4036a430c1d76de904e
|
||||
|
||||
# User provided section
|
||||
|
||||
# List of Local paths (relative to this file) that should be
|
||||
# ignored by the migrate tool.
|
||||
#
|
||||
# Files that are not part of the templates will be ignored by default.
|
||||
unmanaged_files:
|
||||
- 'lib/main.dart'
|
||||
- 'ios/Runner.xcodeproj/project.pbxproj'
|
||||
38
flutter/README.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# Supertonic Flutter Example
|
||||
|
||||
This example demonstrates how to use Supertonic 2 in a Flutter application using ONNX Runtime.
|
||||
|
||||
> **Note:** This project uses the `flutter_onnxruntime` package ([https://pub.dev/packages/flutter_onnxruntime](https://pub.dev/packages/flutter_onnxruntime)). At the moment, only the macOS platform has been tested. Although the flutter_onnxruntime package supports several other platforms, they have not been tested in this project yet and may require additional verification.
|
||||
|
||||
|
||||
## 📰 Update News
|
||||
|
||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
||||
|
||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||
|
||||
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
|
||||
|
||||
**2025.11.23** - Added and tested macos support.
|
||||
|
||||
## Multilingual Support
|
||||
|
||||
Supertonic 2 supports multiple languages. Select the appropriate language from the dropdown:
|
||||
- **English (en)**: Default language
|
||||
- **한국어 (ko)**: Korean
|
||||
- **Español (es)**: Spanish
|
||||
- **Português (pt)**: Portuguese
|
||||
- **Français (fr)**: French
|
||||
|
||||
## Requirements
|
||||
|
||||
- Flutter SDK version ^3.5.0
|
||||
|
||||
## Running the Demo
|
||||
|
||||
```bash
|
||||
flutter clean
|
||||
flutter pub get
|
||||
flutter run -d macos
|
||||
```
|
||||
|
||||
28
flutter/analysis_options.yaml
Normal file
@@ -0,0 +1,28 @@
|
||||
# This file configures the analyzer, which statically analyzes Dart code to
|
||||
# check for errors, warnings, and lints.
|
||||
#
|
||||
# The issues identified by the analyzer are surfaced in the UI of Dart-enabled
|
||||
# IDEs (https://dart.dev/tools#ides-and-editors). The analyzer can also be
|
||||
# invoked from the command line by running `flutter analyze`.
|
||||
|
||||
# The following line activates a set of recommended lints for Flutter apps,
|
||||
# packages, and plugins designed to encourage good coding practices.
|
||||
include: package:flutter_lints/flutter.yaml
|
||||
|
||||
linter:
|
||||
# The lint rules applied to this project can be customized in the
|
||||
# section below to disable rules from the `package:flutter_lints/flutter.yaml`
|
||||
# included above or to enable additional rules. A list of all available lints
|
||||
# and their documentation is published at https://dart.dev/lints.
|
||||
#
|
||||
# Instead of disabling a lint rule for the entire project in the
|
||||
# section below, it can also be suppressed for a single line of code
|
||||
# or a specific dart file by using the `// ignore: name_of_lint` and
|
||||
# `// ignore_for_file: name_of_lint` syntax on the line or in the file
|
||||
# producing the lint.
|
||||
rules:
|
||||
# avoid_print: false # Uncomment to disable the `avoid_print` rule
|
||||
# prefer_single_quotes: true # Uncomment to enable the `prefer_single_quotes` rule
|
||||
|
||||
# Additional information about this file can be found at
|
||||
# https://dart.dev/guides/language/analysis-options
|
||||
695
flutter/lib/helper.dart
Normal file
@@ -0,0 +1,695 @@
|
||||
import 'dart:io';
|
||||
import 'dart:convert';
|
||||
import 'dart:math' as math;
|
||||
import 'dart:typed_data';
|
||||
import 'package:flutter/services.dart' show rootBundle;
|
||||
import 'package:flutter_onnxruntime/flutter_onnxruntime.dart';
|
||||
import 'package:logger/logger.dart';
|
||||
import 'package:path_provider/path_provider.dart';
|
||||
|
||||
final logger = Logger(
|
||||
printer: PrettyPrinter(methodCount: 0, errorMethodCount: 5, lineLength: 80),
|
||||
);
|
||||
|
||||
// Available languages for multilingual TTS
|
||||
const List<String> availableLangs = ['en', 'ko', 'es', 'pt', 'fr'];
|
||||
|
||||
bool isValidLang(String lang) => availableLangs.contains(lang);
|
||||
|
||||
// Hangul Jamo constants for NFKD decomposition
|
||||
const int _hangulSyllableBase = 0xAC00;
|
||||
const int _hangulSyllableEnd = 0xD7A3;
|
||||
const int _leadingJamoBase = 0x1100;
|
||||
const int _vowelJamoBase = 0x1161;
|
||||
const int _trailingJamoBase = 0x11A7;
|
||||
const int _vowelCount = 21;
|
||||
const int _trailingCount = 28;
|
||||
|
||||
/// Decompose a Hangul syllable into Jamo (NFKD-like decomposition)
|
||||
List<int> _decomposeHangulSyllable(int codePoint) {
|
||||
if (codePoint < _hangulSyllableBase || codePoint > _hangulSyllableEnd) {
|
||||
return [codePoint];
|
||||
}
|
||||
|
||||
final syllableIndex = codePoint - _hangulSyllableBase;
|
||||
final leadingIndex = syllableIndex ~/ (_vowelCount * _trailingCount);
|
||||
final vowelIndex =
|
||||
(syllableIndex % (_vowelCount * _trailingCount)) ~/ _trailingCount;
|
||||
final trailingIndex = syllableIndex % _trailingCount;
|
||||
|
||||
final result = <int>[
|
||||
_leadingJamoBase + leadingIndex,
|
||||
_vowelJamoBase + vowelIndex,
|
||||
];
|
||||
|
||||
if (trailingIndex > 0) {
|
||||
result.add(_trailingJamoBase + trailingIndex);
|
||||
}
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
/// Common Latin character decompositions (NFKD) for es, pt, fr
|
||||
const Map<int, List<int>> _latinDecompositions = {
|
||||
// Uppercase with acute accent
|
||||
0x00C1: [0x0041, 0x0301], // Á → A + ́
|
||||
0x00C9: [0x0045, 0x0301], // É → E + ́
|
||||
0x00CD: [0x0049, 0x0301], // Í → I + ́
|
||||
0x00D3: [0x004F, 0x0301], // Ó → O + ́
|
||||
0x00DA: [0x0055, 0x0301], // Ú → U + ́
|
||||
// Lowercase with acute accent
|
||||
0x00E1: [0x0061, 0x0301], // á → a + ́
|
||||
0x00E9: [0x0065, 0x0301], // é → e + ́
|
||||
0x00ED: [0x0069, 0x0301], // í → i + ́
|
||||
0x00F3: [0x006F, 0x0301], // ó → o + ́
|
||||
0x00FA: [0x0075, 0x0301], // ú → u + ́
|
||||
// Grave accent
|
||||
0x00C0: [0x0041, 0x0300], // À → A + ̀
|
||||
0x00C8: [0x0045, 0x0300], // È → E + ̀
|
||||
0x00CC: [0x0049, 0x0300], // Ì → I + ̀
|
||||
0x00D2: [0x004F, 0x0300], // Ò → O + ̀
|
||||
0x00D9: [0x0055, 0x0300], // Ù → U + ̀
|
||||
0x00E0: [0x0061, 0x0300], // à → a + ̀
|
||||
0x00E8: [0x0065, 0x0300], // è → e + ̀
|
||||
0x00EC: [0x0069, 0x0300], // ì → i + ̀
|
||||
0x00F2: [0x006F, 0x0300], // ò → o + ̀
|
||||
0x00F9: [0x0075, 0x0300], // ù → u + ̀
|
||||
// Circumflex
|
||||
0x00C2: [0x0041, 0x0302], // Â → A + ̂
|
||||
0x00CA: [0x0045, 0x0302], // Ê → E + ̂
|
||||
0x00CE: [0x0049, 0x0302], // Î → I + ̂
|
||||
0x00D4: [0x004F, 0x0302], // Ô → O + ̂
|
||||
0x00DB: [0x0055, 0x0302], // Û → U + ̂
|
||||
0x00E2: [0x0061, 0x0302], // â → a + ̂
|
||||
0x00EA: [0x0065, 0x0302], // ê → e + ̂
|
||||
0x00EE: [0x0069, 0x0302], // î → i + ̂
|
||||
0x00F4: [0x006F, 0x0302], // ô → o + ̂
|
||||
0x00FB: [0x0075, 0x0302], // û → u + ̂
|
||||
// Tilde
|
||||
0x00C3: [0x0041, 0x0303], // Ã → A + ̃
|
||||
0x00D1: [0x004E, 0x0303], // Ñ → N + ̃
|
||||
0x00D5: [0x004F, 0x0303], // Õ → O + ̃
|
||||
0x00E3: [0x0061, 0x0303], // ã → a + ̃
|
||||
0x00F1: [0x006E, 0x0303], // ñ → n + ̃
|
||||
0x00F5: [0x006F, 0x0303], // õ → o + ̃
|
||||
// Diaeresis/Umlaut
|
||||
0x00C4: [0x0041, 0x0308], // Ä → A + ̈
|
||||
0x00CB: [0x0045, 0x0308], // Ë → E + ̈
|
||||
0x00CF: [0x0049, 0x0308], // Ï → I + ̈
|
||||
0x00D6: [0x004F, 0x0308], // Ö → O + ̈
|
||||
0x00DC: [0x0055, 0x0308], // Ü → U + ̈
|
||||
0x00E4: [0x0061, 0x0308], // ä → a + ̈
|
||||
0x00EB: [0x0065, 0x0308], // ë → e + ̈
|
||||
0x00EF: [0x0069, 0x0308], // ï → i + ̈
|
||||
0x00F6: [0x006F, 0x0308], // ö → o + ̈
|
||||
0x00FC: [0x0075, 0x0308], // ü → u + ̈
|
||||
// Cedilla
|
||||
0x00C7: [0x0043, 0x0327], // Ç → C + ̧
|
||||
0x00E7: [0x0063, 0x0327], // ç → c + ̧
|
||||
};
|
||||
|
||||
/// Apply NFKD-like decomposition (Hangul + Latin accented characters)
|
||||
String _applyNfkdDecomposition(String text) {
|
||||
final result = <int>[];
|
||||
for (final codePoint in text.runes) {
|
||||
// Check Hangul first
|
||||
if (codePoint >= _hangulSyllableBase && codePoint <= _hangulSyllableEnd) {
|
||||
result.addAll(_decomposeHangulSyllable(codePoint));
|
||||
}
|
||||
// Check Latin decomposition
|
||||
else if (_latinDecompositions.containsKey(codePoint)) {
|
||||
result.addAll(_latinDecompositions[codePoint]!);
|
||||
}
|
||||
// Keep as-is
|
||||
else {
|
||||
result.add(codePoint);
|
||||
}
|
||||
}
|
||||
return String.fromCharCodes(result);
|
||||
}
|
||||
|
||||
String preprocessText(String text, String lang) {
|
||||
// Apply NFKD-like decomposition (especially for Hangul syllables → Jamo)
|
||||
text = _applyNfkdDecomposition(text);
|
||||
|
||||
// Remove emojis
|
||||
text = text.replaceAll(
|
||||
RegExp(
|
||||
r'[\u{1F600}-\u{1F64F}]|[\u{1F300}-\u{1F5FF}]|[\u{1F680}-\u{1F6FF}]|'
|
||||
r'[\u{1F700}-\u{1F77F}]|[\u{1F780}-\u{1F7FF}]|[\u{1F800}-\u{1F8FF}]|'
|
||||
r'[\u{1F900}-\u{1F9FF}]|[\u{1FA00}-\u{1FA6F}]|[\u{1FA70}-\u{1FAFF}]|'
|
||||
r'[\u{2600}-\u{26FF}]|[\u{2700}-\u{27BF}]|[\u{1F1E6}-\u{1F1FF}]',
|
||||
unicode: true,
|
||||
),
|
||||
'');
|
||||
|
||||
// Replace various dashes and symbols
|
||||
const replacements = {
|
||||
'–': '-',
|
||||
'‑': '-',
|
||||
'—': '-',
|
||||
'_': ' ',
|
||||
'\u201C': '"',
|
||||
'\u201D': '"',
|
||||
'\u2018': "'",
|
||||
'\u2019': "'",
|
||||
'´': "'",
|
||||
'`': "'",
|
||||
'[': ' ',
|
||||
']': ' ',
|
||||
'|': ' ',
|
||||
'/': ' ',
|
||||
'#': ' ',
|
||||
'→': ' ',
|
||||
'←': ' ',
|
||||
};
|
||||
for (final entry in replacements.entries) {
|
||||
text = text.replaceAll(entry.key, entry.value);
|
||||
}
|
||||
|
||||
// Remove special symbols
|
||||
text = text.replaceAll(RegExp(r'[♥☆♡©\\]'), '');
|
||||
|
||||
// Replace known expressions
|
||||
text = text.replaceAll('@', ' at ');
|
||||
text = text.replaceAll('e.g.,', 'for example, ');
|
||||
text = text.replaceAll('i.e.,', 'that is, ');
|
||||
|
||||
// Fix spacing around punctuation
|
||||
text = text.replaceAll(' ,', ',');
|
||||
text = text.replaceAll(' .', '.');
|
||||
text = text.replaceAll(' !', '!');
|
||||
text = text.replaceAll(' ?', '?');
|
||||
text = text.replaceAll(' ;', ';');
|
||||
text = text.replaceAll(' :', ':');
|
||||
text = text.replaceAll(" '", "'");
|
||||
|
||||
// Remove duplicate quotes
|
||||
while (text.contains('""')) text = text.replaceAll('""', '"');
|
||||
while (text.contains("''")) text = text.replaceAll("''", "'");
|
||||
while (text.contains('``')) text = text.replaceAll('``', '`');
|
||||
|
||||
// Remove extra spaces
|
||||
text = text.replaceAll(RegExp(r'\s+'), ' ').trim();
|
||||
|
||||
// Add period if needed
|
||||
if (text.isNotEmpty &&
|
||||
!RegExp(r'[.!?;:,\x27\x22\u2018\u2019)\]}…。」』】〉》›»]$').hasMatch(text)) {
|
||||
text += '.';
|
||||
}
|
||||
|
||||
// Validate language
|
||||
if (!isValidLang(lang)) {
|
||||
throw ArgumentError(
|
||||
'Invalid language: $lang. Available: ${availableLangs.join(", ")}');
|
||||
}
|
||||
|
||||
// Wrap text with language tags
|
||||
text = '<$lang>$text</$lang>';
|
||||
|
||||
return text;
|
||||
}
|
||||
|
||||
class UnicodeProcessor {
|
||||
final Map<int, int> indexer;
|
||||
|
||||
UnicodeProcessor._(this.indexer);
|
||||
|
||||
static Future<UnicodeProcessor> load(String path) async {
|
||||
final json = jsonDecode(
|
||||
path.startsWith('assets/')
|
||||
? await rootBundle.loadString(path)
|
||||
: File(path).readAsStringSync(),
|
||||
);
|
||||
|
||||
final indexer = json is List
|
||||
? {
|
||||
for (var i = 0; i < json.length; i++)
|
||||
if (json[i] is int && json[i] >= 0) i: json[i] as int
|
||||
}
|
||||
: (json as Map<String, dynamic>)
|
||||
.map((k, v) => MapEntry(int.parse(k), v as int));
|
||||
|
||||
return UnicodeProcessor._(indexer);
|
||||
}
|
||||
|
||||
Map<String, dynamic> call(List<String> textList, List<String> langList) {
|
||||
// Preprocess texts with language tags
|
||||
final processedTexts = <String>[];
|
||||
for (var i = 0; i < textList.length; i++) {
|
||||
processedTexts.add(preprocessText(textList[i], langList[i]));
|
||||
}
|
||||
|
||||
final lengths = processedTexts.map((t) => t.runes.length).toList();
|
||||
final maxLen = lengths.reduce(math.max);
|
||||
|
||||
final textIds = processedTexts.map((text) {
|
||||
final row = List<int>.filled(maxLen, 0);
|
||||
final runes = text.runes.toList();
|
||||
for (var i = 0; i < runes.length; i++) {
|
||||
row[i] = indexer[runes[i]] ?? 0;
|
||||
}
|
||||
return row;
|
||||
}).toList();
|
||||
|
||||
return {'textIds': textIds, 'textMask': _lengthToMask(lengths)};
|
||||
}
|
||||
|
||||
List<List<List<double>>> _lengthToMask(List<int> lengths, [int? maxLen]) {
|
||||
maxLen ??= lengths.reduce(math.max);
|
||||
return lengths
|
||||
.map((len) => [List.generate(maxLen!, (i) => i < len ? 1.0 : 0.0)])
|
||||
.toList();
|
||||
}
|
||||
}
|
||||
|
||||
class Style {
|
||||
final OrtValue ttl, dp;
|
||||
final List<int> ttlShape, dpShape;
|
||||
Style(this.ttl, this.dp, this.ttlShape, this.dpShape);
|
||||
}
|
||||
|
||||
class TextToSpeech {
|
||||
final Map<String, dynamic> cfgs;
|
||||
final UnicodeProcessor textProcessor;
|
||||
final OrtSession dpOrt, textEncOrt, vectorEstOrt, vocoderOrt;
|
||||
final int sampleRate, baseChunkSize, chunkCompressFactor, ldim;
|
||||
|
||||
TextToSpeech(this.cfgs, this.textProcessor, this.dpOrt, this.textEncOrt,
|
||||
this.vectorEstOrt, this.vocoderOrt)
|
||||
: sampleRate = cfgs['ae']['sample_rate'],
|
||||
baseChunkSize = cfgs['ae']['base_chunk_size'],
|
||||
chunkCompressFactor = cfgs['ttl']['chunk_compress_factor'],
|
||||
ldim = cfgs['ttl']['latent_dim'];
|
||||
|
||||
Future<Map<String, dynamic>> call(
|
||||
String text, String lang, Style style, int totalStep,
|
||||
{double speed = 1.05, double silenceDuration = 0.3}) async {
|
||||
final maxLen = lang == 'ko' ? 120 : 300;
|
||||
final chunks = _chunkText(text, maxLen: maxLen);
|
||||
final langList = List.filled(chunks.length, lang);
|
||||
List<double>? wavCat;
|
||||
double durCat = 0;
|
||||
|
||||
for (var i = 0; i < chunks.length; i++) {
|
||||
final result = await _infer([chunks[i]], [langList[i]], style, totalStep,
|
||||
speed: speed);
|
||||
final wav = _safeCast<double>(result['wav']);
|
||||
final duration = _safeCast<double>(result['duration']);
|
||||
|
||||
if (wavCat == null) {
|
||||
wavCat = wav;
|
||||
durCat = duration[0];
|
||||
} else {
|
||||
wavCat = [
|
||||
...wavCat,
|
||||
...List<double>.filled((silenceDuration * sampleRate).floor(), 0.0),
|
||||
...wav
|
||||
];
|
||||
durCat += duration[0] + silenceDuration;
|
||||
}
|
||||
}
|
||||
|
||||
return {
|
||||
'wav': wavCat,
|
||||
'duration': [durCat]
|
||||
};
|
||||
}
|
||||
|
||||
Future<Map<String, dynamic>> _infer(
|
||||
List<String> textList, List<String> langList, Style style, int totalStep,
|
||||
{double speed = 1.05}) async {
|
||||
final bsz = textList.length;
|
||||
final result = textProcessor.call(textList, langList);
|
||||
|
||||
final textIdsRaw = result['textIds'];
|
||||
final textIds = textIdsRaw is List<List<int>>
|
||||
? textIdsRaw
|
||||
: (textIdsRaw as List).map((row) => (row as List).cast<int>()).toList();
|
||||
|
||||
final textMaskRaw = result['textMask'];
|
||||
final textMask = textMaskRaw is List<List<List<double>>>
|
||||
? textMaskRaw
|
||||
: (textMaskRaw as List)
|
||||
.map((batch) => (batch as List)
|
||||
.map((row) => (row as List).cast<double>())
|
||||
.toList())
|
||||
.toList();
|
||||
|
||||
final textIdsShape = [bsz, textIds[0].length];
|
||||
final textMaskShape = [bsz, 1, textMask[0][0].length];
|
||||
final textMaskTensor = await _toTensor(textMask, textMaskShape);
|
||||
|
||||
final dpResult = await dpOrt.run({
|
||||
'text_ids': await _intToTensor(textIds, textIdsShape),
|
||||
'style_dp': style.dp,
|
||||
'text_mask': textMaskTensor,
|
||||
});
|
||||
final durOnnx = _safeCast<double>(await dpResult.values.first.asList());
|
||||
final scaledDur = durOnnx.map((d) => d / speed).toList();
|
||||
|
||||
final textEncResult = await textEncOrt.run({
|
||||
'text_ids': await _intToTensor(textIds, textIdsShape),
|
||||
'style_ttl': style.ttl,
|
||||
'text_mask': textMaskTensor,
|
||||
});
|
||||
|
||||
final latentData = _sampleNoisyLatent(scaledDur);
|
||||
final noisyLatentRaw = latentData['noisyLatent'];
|
||||
var noisyLatent = noisyLatentRaw is List<List<List<double>>>
|
||||
? noisyLatentRaw
|
||||
: (noisyLatentRaw as List)
|
||||
.map((batch) => (batch as List)
|
||||
.map((row) => (row as List).cast<double>())
|
||||
.toList())
|
||||
.toList();
|
||||
|
||||
final latentMaskRaw = latentData['latentMask'];
|
||||
final latentMask = latentMaskRaw is List<List<List<double>>>
|
||||
? latentMaskRaw
|
||||
: (latentMaskRaw as List)
|
||||
.map((batch) => (batch as List)
|
||||
.map((row) => (row as List).cast<double>())
|
||||
.toList())
|
||||
.toList();
|
||||
|
||||
final latentShape = [bsz, noisyLatent[0].length, noisyLatent[0][0].length];
|
||||
final latentMaskTensor =
|
||||
await _toTensor(latentMask, [bsz, 1, latentMask[0][0].length]);
|
||||
|
||||
final totalStepTensor =
|
||||
await _scalarToTensor(List.filled(bsz, totalStep.toDouble()), [bsz]);
|
||||
|
||||
// Denoising loop
|
||||
for (var step = 0; step < totalStep; step++) {
|
||||
final result = await vectorEstOrt.run({
|
||||
'noisy_latent': await _toTensor(noisyLatent, latentShape),
|
||||
'text_emb': textEncResult.values.first,
|
||||
'style_ttl': style.ttl,
|
||||
'text_mask': textMaskTensor,
|
||||
'latent_mask': latentMaskTensor,
|
||||
'total_step': totalStepTensor,
|
||||
'current_step':
|
||||
await _scalarToTensor(List.filled(bsz, step.toDouble()), [bsz]),
|
||||
});
|
||||
|
||||
final denoisedRaw = await result.values.first.asList();
|
||||
final denoised = denoisedRaw is List<double>
|
||||
? denoisedRaw
|
||||
: _safeCast<double>(denoisedRaw);
|
||||
var idx = 0;
|
||||
for (var b = 0; b < noisyLatent.length; b++) {
|
||||
for (var d = 0; d < noisyLatent[b].length; d++) {
|
||||
for (var t = 0; t < noisyLatent[b][d].length; t++) {
|
||||
noisyLatent[b][d][t] = denoised[idx++];
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
final vocoderResult = await vocoderOrt
|
||||
.run({'latent': await _toTensor(noisyLatent, latentShape)});
|
||||
final wavRaw = await vocoderResult.values.first.asList();
|
||||
final wav = wavRaw is List<double> ? wavRaw : _safeCast<double>(wavRaw);
|
||||
|
||||
return {'wav': wav, 'duration': scaledDur};
|
||||
}
|
||||
|
||||
Map<String, dynamic> _sampleNoisyLatent(List<double> duration) {
|
||||
final wavLenMax = duration.reduce(math.max) * sampleRate;
|
||||
final wavLengths = duration.map((d) => (d * sampleRate).floor()).toList();
|
||||
final chunkSize = baseChunkSize * chunkCompressFactor;
|
||||
final latentLen = ((wavLenMax + chunkSize - 1) / chunkSize).floor();
|
||||
final latentDim = ldim * chunkCompressFactor;
|
||||
|
||||
final random = math.Random();
|
||||
final noisyLatent = List.generate(
|
||||
duration.length,
|
||||
(_) => List.generate(
|
||||
latentDim,
|
||||
(_) => List.generate(latentLen, (_) {
|
||||
final u1 = math.max(1e-10, random.nextDouble());
|
||||
final u2 = random.nextDouble();
|
||||
return math.sqrt(-2.0 * math.log(u1)) * math.cos(2.0 * math.pi * u2);
|
||||
}),
|
||||
),
|
||||
);
|
||||
|
||||
final latentMask = _getLatentMask(wavLengths);
|
||||
|
||||
for (var b = 0; b < noisyLatent.length; b++) {
|
||||
for (var d = 0; d < noisyLatent[b].length; d++) {
|
||||
for (var t = 0; t < noisyLatent[b][d].length; t++) {
|
||||
noisyLatent[b][d][t] *= latentMask[b][0][t];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return {'noisyLatent': noisyLatent, 'latentMask': latentMask};
|
||||
}
|
||||
|
||||
List<List<List<double>>> _getLatentMask(List<int> wavLengths) {
|
||||
final latentSize = baseChunkSize * chunkCompressFactor;
|
||||
final latentLengths = wavLengths
|
||||
.map((len) => ((len + latentSize - 1) / latentSize).floor())
|
||||
.toList();
|
||||
final maxLen = latentLengths.reduce(math.max);
|
||||
return latentLengths
|
||||
.map((len) => [List.generate(maxLen, (i) => i < len ? 1.0 : 0.0)])
|
||||
.toList();
|
||||
}
|
||||
|
||||
List<String> _chunkText(String text, {int maxLen = 300}) {
|
||||
final paragraphs = text
|
||||
.trim()
|
||||
.split(RegExp(r'\n\s*\n+'))
|
||||
.where((p) => p.trim().isNotEmpty)
|
||||
.toList();
|
||||
|
||||
final chunks = <String>[];
|
||||
for (var paragraph in paragraphs) {
|
||||
paragraph = paragraph.trim();
|
||||
if (paragraph.isEmpty) continue;
|
||||
|
||||
final sentences = paragraph.split(RegExp(
|
||||
r'(?<!Mr\.|Mrs\.|Ms\.|Dr\.|Prof\.)(?<!\b[A-Z]\.)(?<=[.!?])\s+'));
|
||||
|
||||
var currentChunk = '';
|
||||
for (final sentence in sentences) {
|
||||
if (currentChunk.length + sentence.length + 1 <= maxLen) {
|
||||
currentChunk += (currentChunk.isNotEmpty ? ' ' : '') + sentence;
|
||||
} else {
|
||||
if (currentChunk.isNotEmpty) chunks.add(currentChunk.trim());
|
||||
currentChunk = sentence;
|
||||
}
|
||||
}
|
||||
if (currentChunk.isNotEmpty) chunks.add(currentChunk.trim());
|
||||
}
|
||||
|
||||
return chunks;
|
||||
}
|
||||
|
||||
List<T> _safeCast<T>(dynamic raw) {
|
||||
if (raw is List<T>) return raw;
|
||||
if (raw is List) {
|
||||
if (raw.isNotEmpty && raw.first is List) {
|
||||
return _flattenList<T>(raw);
|
||||
}
|
||||
if (T == double) {
|
||||
return raw
|
||||
.map((e) => e is num ? e.toDouble() : double.parse(e.toString()))
|
||||
.toList() as List<T>;
|
||||
}
|
||||
return raw.cast<T>();
|
||||
}
|
||||
throw Exception('Cannot convert $raw to List<$T>');
|
||||
}
|
||||
|
||||
List<T> _flattenList<T>(dynamic list) {
|
||||
if (list is List) {
|
||||
return list.expand((e) => _flattenList<T>(e)).toList();
|
||||
}
|
||||
if (T == double && list is num) {
|
||||
return [list.toDouble()] as List<T>;
|
||||
}
|
||||
return [list as T];
|
||||
}
|
||||
|
||||
Future<OrtValue> _toTensor(dynamic array, List<int> dims) async {
|
||||
final flat = _flattenList<double>(array);
|
||||
return await OrtValue.fromList(Float32List.fromList(flat), dims);
|
||||
}
|
||||
|
||||
Future<OrtValue> _scalarToTensor(List<double> array, List<int> dims) async {
|
||||
return await OrtValue.fromList(Float32List.fromList(array), dims);
|
||||
}
|
||||
|
||||
Future<OrtValue> _intToTensor(List<List<int>> array, List<int> dims) async {
|
||||
final flat = array.expand((row) => row).toList();
|
||||
return await OrtValue.fromList(Int64List.fromList(flat), dims);
|
||||
}
|
||||
}
|
||||
|
||||
Future<TextToSpeech> loadTextToSpeech(String onnxDir,
|
||||
{bool useGpu = false}) async {
|
||||
if (useGpu) throw Exception('GPU mode not supported yet');
|
||||
|
||||
logger.i('Loading TTS models from $onnxDir');
|
||||
|
||||
final cfgs = await _loadCfgs(onnxDir);
|
||||
final sessions = await _loadOnnxAll(onnxDir);
|
||||
final textProcessor =
|
||||
await UnicodeProcessor.load('$onnxDir/unicode_indexer.json');
|
||||
|
||||
logger.i('TTS models loaded successfully');
|
||||
|
||||
return TextToSpeech(
|
||||
cfgs,
|
||||
textProcessor,
|
||||
sessions['dpOrt']!,
|
||||
sessions['textEncOrt']!,
|
||||
sessions['vectorEstOrt']!,
|
||||
sessions['vocoderOrt']!,
|
||||
);
|
||||
}
|
||||
|
||||
Future<Style> loadVoiceStyle(List<String> paths) async {
|
||||
final bsz = paths.length;
|
||||
|
||||
final firstJson = jsonDecode(
|
||||
paths[0].startsWith('assets/')
|
||||
? await rootBundle.loadString(paths[0])
|
||||
: File(paths[0]).readAsStringSync(),
|
||||
);
|
||||
|
||||
final ttlDims = List<int>.from(firstJson['style_ttl']['dims']);
|
||||
final dpDims = List<int>.from(firstJson['style_dp']['dims']);
|
||||
|
||||
final ttlFlat = Float32List(bsz * ttlDims[1] * ttlDims[2]);
|
||||
final dpFlat = Float32List(bsz * dpDims[1] * dpDims[2]);
|
||||
|
||||
for (var i = 0; i < bsz; i++) {
|
||||
final json = jsonDecode(
|
||||
paths[i].startsWith('assets/')
|
||||
? await rootBundle.loadString(paths[i])
|
||||
: File(paths[i]).readAsStringSync(),
|
||||
);
|
||||
|
||||
final ttlData = _flattenToDouble(json['style_ttl']['data']);
|
||||
final dpData = _flattenToDouble(json['style_dp']['data']);
|
||||
|
||||
ttlFlat.setRange(i * ttlDims[1] * ttlDims[2],
|
||||
(i + 1) * ttlDims[1] * ttlDims[2], ttlData);
|
||||
dpFlat.setRange(
|
||||
i * dpDims[1] * dpDims[2], (i + 1) * dpDims[1] * dpDims[2], dpData);
|
||||
}
|
||||
|
||||
final ttlShape = [bsz, ttlDims[1], ttlDims[2]];
|
||||
final dpShape = [bsz, dpDims[1], dpDims[2]];
|
||||
|
||||
return Style(
|
||||
await OrtValue.fromList(ttlFlat, ttlShape),
|
||||
await OrtValue.fromList(dpFlat, dpShape),
|
||||
ttlShape,
|
||||
dpShape,
|
||||
);
|
||||
}
|
||||
|
||||
Future<Map<String, dynamic>> _loadCfgs(String onnxDir) async {
|
||||
final path = '$onnxDir/tts.json';
|
||||
final json = jsonDecode(await rootBundle.loadString(path));
|
||||
return json as Map<String, dynamic>;
|
||||
}
|
||||
|
||||
Future<String> copyModelToFile(String path) async {
|
||||
final byteData = await rootBundle.load(path);
|
||||
final tempDir = await getApplicationCacheDirectory();
|
||||
final modelPath = '${tempDir.path}/${path.split("/").last}';
|
||||
|
||||
final file = File(modelPath);
|
||||
await file.writeAsBytes(byteData.buffer.asUint8List());
|
||||
return modelPath;
|
||||
}
|
||||
|
||||
Future<Map<String, OrtSession>> _loadOnnxAll(String dir) async {
|
||||
final ort = OnnxRuntime();
|
||||
final models = [
|
||||
'duration_predictor',
|
||||
'text_encoder',
|
||||
'vector_estimator',
|
||||
'vocoder'
|
||||
];
|
||||
|
||||
final sessions = await Future.wait(models.map((name) async {
|
||||
final path = await copyModelToFile('$dir/$name.onnx');
|
||||
logger.d('Loading $name.onnx');
|
||||
return ort.createSessionFromAsset(path);
|
||||
}));
|
||||
|
||||
return {
|
||||
'dpOrt': sessions[0],
|
||||
'textEncOrt': sessions[1],
|
||||
'vectorEstOrt': sessions[2],
|
||||
'vocoderOrt': sessions[3],
|
||||
};
|
||||
}
|
||||
|
||||
List<double> _flattenToDouble(dynamic list) {
|
||||
if (list is List) return list.expand((e) => _flattenToDouble(e)).toList();
|
||||
return [list is num ? list.toDouble() : double.parse(list.toString())];
|
||||
}
|
||||
|
||||
void writeWavFile(String filename, List<double> audioData, int sampleRate) {
|
||||
const numChannels = 1;
|
||||
const bitsPerSample = 16;
|
||||
final dataSize = audioData.length * 2;
|
||||
|
||||
final buffer = ByteData(44 + dataSize);
|
||||
var offset = 0;
|
||||
|
||||
// RIFF header
|
||||
for (var byte in [0x52, 0x49, 0x46, 0x46]) {
|
||||
buffer.setUint8(offset++, byte);
|
||||
}
|
||||
buffer.setUint32(offset, 36 + dataSize, Endian.little);
|
||||
offset += 4;
|
||||
|
||||
// WAVE
|
||||
for (var byte in [0x57, 0x41, 0x56, 0x45]) {
|
||||
buffer.setUint8(offset++, byte);
|
||||
}
|
||||
|
||||
// fmt chunk
|
||||
for (var byte in [0x66, 0x6D, 0x74, 0x20]) {
|
||||
buffer.setUint8(offset++, byte);
|
||||
}
|
||||
buffer.setUint32(offset, 16, Endian.little);
|
||||
offset += 4;
|
||||
buffer.setUint16(offset, 1, Endian.little);
|
||||
offset += 2;
|
||||
buffer.setUint16(offset, numChannels, Endian.little);
|
||||
offset += 2;
|
||||
buffer.setUint32(offset, sampleRate, Endian.little);
|
||||
offset += 4;
|
||||
buffer.setUint32(offset, sampleRate * numChannels * 2, Endian.little);
|
||||
offset += 4;
|
||||
buffer.setUint16(offset, numChannels * 2, Endian.little);
|
||||
offset += 2;
|
||||
buffer.setUint16(offset, bitsPerSample, Endian.little);
|
||||
offset += 2;
|
||||
|
||||
// data chunk
|
||||
for (var byte in [0x64, 0x61, 0x74, 0x61]) {
|
||||
buffer.setUint8(offset++, byte);
|
||||
}
|
||||
buffer.setUint32(offset, dataSize, Endian.little);
|
||||
offset += 4;
|
||||
|
||||
// Write audio samples
|
||||
for (var i = 0; i < audioData.length; i++) {
|
||||
final sample = (audioData[i].clamp(-1.0, 1.0) * 32767).round();
|
||||
buffer.setInt16(offset + i * 2, sample, Endian.little);
|
||||
}
|
||||
|
||||
File(filename).writeAsBytesSync(buffer.buffer.asUint8List());
|
||||
}
|
||||
391
flutter/lib/main.dart
Normal file
@@ -0,0 +1,391 @@
|
||||
import 'dart:io';
|
||||
import 'package:flutter/material.dart';
|
||||
import 'package:just_audio/just_audio.dart';
|
||||
import 'package:path_provider/path_provider.dart';
|
||||
import 'package:flutter_sdk/helper.dart';
|
||||
|
||||
void main() {
|
||||
runApp(const SupertonicApp());
|
||||
}
|
||||
|
||||
class SupertonicApp extends StatelessWidget {
|
||||
const SupertonicApp({super.key});
|
||||
|
||||
@override
|
||||
Widget build(BuildContext context) {
|
||||
return MaterialApp(
|
||||
title: 'Supertonic 2',
|
||||
theme: ThemeData(
|
||||
colorScheme: ColorScheme.fromSeed(seedColor: Colors.deepPurple),
|
||||
useMaterial3: true,
|
||||
),
|
||||
home: const TTSPage(),
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
class TTSPage extends StatefulWidget {
|
||||
const TTSPage({super.key});
|
||||
|
||||
@override
|
||||
State<TTSPage> createState() => _TTSPageState();
|
||||
}
|
||||
|
||||
class _TTSPageState extends State<TTSPage> {
|
||||
final TextEditingController _textController = TextEditingController(
|
||||
text: 'Hello, this is a text to speech example.',
|
||||
);
|
||||
final AudioPlayer _audioPlayer = AudioPlayer();
|
||||
|
||||
TextToSpeech? _textToSpeech;
|
||||
Style? _style;
|
||||
bool _isLoading = false;
|
||||
bool _isGenerating = false;
|
||||
String _status = 'Not initialized';
|
||||
int _totalSteps = 5;
|
||||
double _speed = 1.05;
|
||||
String _selectedLang = 'en';
|
||||
bool _isPlaying = false;
|
||||
String? _lastGeneratedFilePath;
|
||||
|
||||
@override
|
||||
void initState() {
|
||||
super.initState();
|
||||
_loadModels();
|
||||
_setupAudioPlayerListeners();
|
||||
}
|
||||
|
||||
void _setupAudioPlayerListeners() {
|
||||
_audioPlayer.playerStateStream.listen((state) {
|
||||
if (!mounted) return;
|
||||
|
||||
setState(() {
|
||||
_isPlaying = state.playing;
|
||||
|
||||
if (state.processingState == ProcessingState.completed) {
|
||||
_isPlaying = false;
|
||||
_status = 'Ready';
|
||||
} else if (state.processingState == ProcessingState.loading) {
|
||||
_status = 'Loading audio...';
|
||||
} else if (state.processingState == ProcessingState.buffering) {
|
||||
_status = 'Buffering...';
|
||||
}
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
Future<void> _loadModels() async {
|
||||
setState(() {
|
||||
_isLoading = true;
|
||||
_status = 'Loading models...';
|
||||
});
|
||||
|
||||
try {
|
||||
_textToSpeech = await loadTextToSpeech('assets/onnx', useGpu: false);
|
||||
_style = await loadVoiceStyle(['assets/voice_styles/M1.json']);
|
||||
|
||||
setState(() {
|
||||
_isLoading = false;
|
||||
_status = 'Ready';
|
||||
});
|
||||
} catch (e, stackTrace) {
|
||||
logger.e('Error loading models', error: e, stackTrace: stackTrace);
|
||||
setState(() {
|
||||
_isLoading = false;
|
||||
_status = 'Error: $e';
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
Future<void> _generateSpeech() async {
|
||||
if (_textToSpeech == null || _style == null) {
|
||||
setState(() => _status = 'Models not loaded yet');
|
||||
return;
|
||||
}
|
||||
|
||||
if (_textController.text.trim().isEmpty) {
|
||||
setState(() => _status = 'Please enter some text');
|
||||
return;
|
||||
}
|
||||
|
||||
setState(() {
|
||||
_isGenerating = true;
|
||||
_status = 'Generating speech...';
|
||||
});
|
||||
|
||||
List<double>? wav;
|
||||
List<double>? duration;
|
||||
|
||||
// Step 1: Generate speech
|
||||
try {
|
||||
final result = await _textToSpeech!.call(
|
||||
_textController.text,
|
||||
_selectedLang,
|
||||
_style!,
|
||||
_totalSteps,
|
||||
speed: _speed,
|
||||
);
|
||||
|
||||
wav = result['wav'] is List<double>
|
||||
? result['wav']
|
||||
: (result['wav'] as List).cast<double>();
|
||||
duration = result['duration'] is List<double>
|
||||
? result['duration']
|
||||
: (result['duration'] as List).cast<double>();
|
||||
} catch (e) {
|
||||
logger.e('Error generating speech', error: e);
|
||||
setState(() {
|
||||
_isGenerating = false;
|
||||
_status = 'Error generating speech: $e';
|
||||
});
|
||||
return;
|
||||
}
|
||||
|
||||
// Step 2: Save to file and play
|
||||
try {
|
||||
final tempDir = await getTemporaryDirectory();
|
||||
final timestamp = DateTime.now().millisecondsSinceEpoch;
|
||||
final outputPath = '${tempDir.path}/speech_$timestamp.wav';
|
||||
|
||||
writeWavFile(outputPath, wav!, _textToSpeech!.sampleRate);
|
||||
|
||||
final file = File(outputPath);
|
||||
if (!file.existsSync()) {
|
||||
throw Exception('Failed to create WAV file');
|
||||
}
|
||||
|
||||
final absolutePath = file.absolute.path;
|
||||
|
||||
setState(() {
|
||||
_isGenerating = false;
|
||||
_status = 'Playing ${duration![0].toStringAsFixed(2)}s of audio...';
|
||||
_lastGeneratedFilePath = absolutePath;
|
||||
});
|
||||
|
||||
logger.i('Audio saved to $absolutePath');
|
||||
|
||||
final uri = Uri.file(absolutePath);
|
||||
await _audioPlayer.setAudioSource(AudioSource.uri(uri));
|
||||
await _audioPlayer.play();
|
||||
} catch (e) {
|
||||
logger.e('Error playing audio', error: e);
|
||||
setState(() {
|
||||
_isGenerating = false;
|
||||
_status = 'Error playing audio: $e';
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
Future<void> _downloadFile() async {
|
||||
if (_lastGeneratedFilePath == null) return;
|
||||
|
||||
try {
|
||||
final sourceFile = File(_lastGeneratedFilePath!);
|
||||
if (!sourceFile.existsSync()) {
|
||||
setState(() => _status = 'Error: File no longer exists');
|
||||
return;
|
||||
}
|
||||
|
||||
final downloadsDir = await getDownloadsDirectory();
|
||||
if (downloadsDir == null) {
|
||||
setState(() => _status = 'Error: Could not access downloads folder');
|
||||
return;
|
||||
}
|
||||
|
||||
final timestamp = DateTime.now().millisecondsSinceEpoch;
|
||||
final downloadPath = '${downloadsDir.path}/speech_$timestamp.wav';
|
||||
|
||||
await sourceFile.copy(downloadPath);
|
||||
logger.i('File saved to $downloadPath');
|
||||
|
||||
setState(() => _status = 'File saved to: $downloadPath');
|
||||
} catch (e) {
|
||||
logger.e('Error downloading file', error: e);
|
||||
setState(() => _status = 'Error downloading file: $e');
|
||||
}
|
||||
}
|
||||
|
||||
@override
|
||||
Widget build(BuildContext context) {
|
||||
return Scaffold(
|
||||
appBar: AppBar(
|
||||
backgroundColor: Theme.of(context).colorScheme.inversePrimary,
|
||||
title: const Text('Supertonic 2'),
|
||||
),
|
||||
body: Padding(
|
||||
padding: const EdgeInsets.all(16.0),
|
||||
child: Column(
|
||||
crossAxisAlignment: CrossAxisAlignment.stretch,
|
||||
children: [
|
||||
// Status indicator
|
||||
Card(
|
||||
color: _isLoading || _isGenerating
|
||||
? Colors.orange.shade100
|
||||
: _status.startsWith('Error')
|
||||
? Colors.red.shade100
|
||||
: Colors.green.shade100,
|
||||
child: Padding(
|
||||
padding: const EdgeInsets.all(16.0),
|
||||
child: Row(
|
||||
children: [
|
||||
if (_isLoading || _isGenerating)
|
||||
const SizedBox(
|
||||
width: 20,
|
||||
height: 20,
|
||||
child: CircularProgressIndicator(strokeWidth: 2),
|
||||
),
|
||||
if (_isLoading || _isGenerating) const SizedBox(width: 12),
|
||||
Expanded(
|
||||
child:
|
||||
Text(_status, style: const TextStyle(fontSize: 16)),
|
||||
),
|
||||
],
|
||||
),
|
||||
),
|
||||
),
|
||||
const SizedBox(height: 24),
|
||||
|
||||
// Text input
|
||||
TextField(
|
||||
controller: _textController,
|
||||
maxLines: 5,
|
||||
decoration: const InputDecoration(
|
||||
labelText: 'Text to synthesize',
|
||||
border: OutlineInputBorder(),
|
||||
hintText: 'Enter the text you want to convert to speech...',
|
||||
),
|
||||
enabled: !_isLoading && !_isGenerating,
|
||||
),
|
||||
const SizedBox(height: 24),
|
||||
|
||||
// Parameters
|
||||
Text('Parameters', style: Theme.of(context).textTheme.titleMedium),
|
||||
const SizedBox(height: 12),
|
||||
|
||||
// Denoising steps slider
|
||||
Row(
|
||||
children: [
|
||||
const Expanded(flex: 2, child: Text('Denoising Steps:')),
|
||||
Expanded(
|
||||
flex: 3,
|
||||
child: Slider(
|
||||
value: _totalSteps.toDouble(),
|
||||
min: 1,
|
||||
max: 20,
|
||||
divisions: 19,
|
||||
label: _totalSteps.toString(),
|
||||
onChanged: _isLoading || _isGenerating
|
||||
? null
|
||||
: (value) =>
|
||||
setState(() => _totalSteps = value.toInt()),
|
||||
),
|
||||
),
|
||||
SizedBox(
|
||||
width: 40,
|
||||
child:
|
||||
Text(_totalSteps.toString(), textAlign: TextAlign.right),
|
||||
),
|
||||
],
|
||||
),
|
||||
|
||||
// Speed slider
|
||||
Row(
|
||||
children: [
|
||||
const Expanded(flex: 2, child: Text('Speed:')),
|
||||
Expanded(
|
||||
flex: 3,
|
||||
child: Slider(
|
||||
value: _speed,
|
||||
min: 0.5,
|
||||
max: 2.0,
|
||||
divisions: 30,
|
||||
label: _speed.toStringAsFixed(2),
|
||||
onChanged: _isLoading || _isGenerating
|
||||
? null
|
||||
: (value) => setState(() => _speed = value),
|
||||
),
|
||||
),
|
||||
SizedBox(
|
||||
width: 40,
|
||||
child: Text(_speed.toStringAsFixed(2),
|
||||
textAlign: TextAlign.right),
|
||||
),
|
||||
],
|
||||
),
|
||||
const SizedBox(height: 12),
|
||||
|
||||
// Language selector
|
||||
Row(
|
||||
children: [
|
||||
const Expanded(flex: 2, child: Text('Language:')),
|
||||
Expanded(
|
||||
flex: 3,
|
||||
child: DropdownButton<String>(
|
||||
value: _selectedLang,
|
||||
isExpanded: true,
|
||||
items: const [
|
||||
DropdownMenuItem(value: 'en', child: Text('English')),
|
||||
DropdownMenuItem(value: 'ko', child: Text('한국어')),
|
||||
DropdownMenuItem(value: 'es', child: Text('Español')),
|
||||
DropdownMenuItem(value: 'pt', child: Text('Português')),
|
||||
DropdownMenuItem(value: 'fr', child: Text('Français')),
|
||||
],
|
||||
onChanged: _isLoading || _isGenerating
|
||||
? null
|
||||
: (value) => setState(() => _selectedLang = value!),
|
||||
),
|
||||
),
|
||||
],
|
||||
),
|
||||
const SizedBox(height: 24),
|
||||
|
||||
// Generate button
|
||||
ElevatedButton.icon(
|
||||
onPressed: _isLoading || _isGenerating
|
||||
? null
|
||||
: _isPlaying
|
||||
? () async {
|
||||
await _audioPlayer.stop();
|
||||
setState(() => _status = 'Ready');
|
||||
}
|
||||
: _generateSpeech,
|
||||
icon: Icon(_isPlaying ? Icons.stop : Icons.play_arrow),
|
||||
label: Text(
|
||||
_isGenerating
|
||||
? 'Generating...'
|
||||
: _isPlaying
|
||||
? 'Stop Playback'
|
||||
: 'Generate & Play Speech',
|
||||
style: const TextStyle(fontSize: 16),
|
||||
),
|
||||
style: ElevatedButton.styleFrom(
|
||||
padding: const EdgeInsets.symmetric(vertical: 16),
|
||||
),
|
||||
),
|
||||
|
||||
// Download button
|
||||
if (_lastGeneratedFilePath != null) ...[
|
||||
const SizedBox(height: 12),
|
||||
OutlinedButton.icon(
|
||||
onPressed: _isLoading || _isGenerating ? null : _downloadFile,
|
||||
icon: const Icon(Icons.download),
|
||||
label: const Text('Download WAV File',
|
||||
style: TextStyle(fontSize: 16)),
|
||||
style: OutlinedButton.styleFrom(
|
||||
padding: const EdgeInsets.symmetric(vertical: 16),
|
||||
),
|
||||
),
|
||||
],
|
||||
],
|
||||
),
|
||||
),
|
||||
);
|
||||
}
|
||||
|
||||
@override
|
||||
void dispose() {
|
||||
_textController.dispose();
|
||||
_audioPlayer.dispose();
|
||||
super.dispose();
|
||||
}
|
||||
}
|
||||
7
flutter/macos/.gitignore
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
# Flutter-related
|
||||
**/Flutter/ephemeral/
|
||||
**/Pods/
|
||||
|
||||
# Xcode-related
|
||||
**/dgph
|
||||
**/xcuserdata/
|
||||
2
flutter/macos/Flutter/Flutter-Debug.xcconfig
Normal file
@@ -0,0 +1,2 @@
|
||||
#include? "Pods/Target Support Files/Pods-Runner/Pods-Runner.debug.xcconfig"
|
||||
#include "ephemeral/Flutter-Generated.xcconfig"
|
||||
2
flutter/macos/Flutter/Flutter-Release.xcconfig
Normal file
@@ -0,0 +1,2 @@
|
||||
#include? "Pods/Target Support Files/Pods-Runner/Pods-Runner.release.xcconfig"
|
||||
#include "ephemeral/Flutter-Generated.xcconfig"
|
||||
16
flutter/macos/Flutter/GeneratedPluginRegistrant.swift
Normal file
@@ -0,0 +1,16 @@
|
||||
//
|
||||
// Generated file. Do not edit.
|
||||
//
|
||||
|
||||
import FlutterMacOS
|
||||
import Foundation
|
||||
|
||||
import audio_session
|
||||
import flutter_onnxruntime
|
||||
import just_audio
|
||||
|
||||
func RegisterGeneratedPlugins(registry: FlutterPluginRegistry) {
|
||||
AudioSessionPlugin.register(with: registry.registrar(forPlugin: "AudioSessionPlugin"))
|
||||
FlutterOnnxruntimePlugin.register(with: registry.registrar(forPlugin: "FlutterOnnxruntimePlugin"))
|
||||
JustAudioPlugin.register(with: registry.registrar(forPlugin: "JustAudioPlugin"))
|
||||
}
|
||||
45
flutter/macos/Podfile
Normal file
@@ -0,0 +1,45 @@
|
||||
platform :osx, '14.0'
|
||||
|
||||
# CocoaPods analytics sends network stats synchronously affecting flutter build latency.
|
||||
ENV['COCOAPODS_DISABLE_STATS'] = 'true'
|
||||
|
||||
project 'Runner', {
|
||||
'Debug' => :debug,
|
||||
'Profile' => :release,
|
||||
'Release' => :release,
|
||||
}
|
||||
|
||||
def flutter_root
|
||||
generated_xcode_build_settings_path = File.expand_path(File.join('..', 'Flutter', 'ephemeral', 'Flutter-Generated.xcconfig'), __FILE__)
|
||||
unless File.exist?(generated_xcode_build_settings_path)
|
||||
raise "#{generated_xcode_build_settings_path} must exist. If you're running pod install manually, make sure \"flutter pub get\" is executed first"
|
||||
end
|
||||
|
||||
File.foreach(generated_xcode_build_settings_path) do |line|
|
||||
matches = line.match(/FLUTTER_ROOT\=(.*)/)
|
||||
return matches[1].strip if matches
|
||||
end
|
||||
raise "FLUTTER_ROOT not found in #{generated_xcode_build_settings_path}. Try deleting Flutter-Generated.xcconfig, then run \"flutter pub get\""
|
||||
end
|
||||
|
||||
require File.expand_path(File.join('packages', 'flutter_tools', 'bin', 'podhelper'), flutter_root)
|
||||
|
||||
flutter_macos_podfile_setup
|
||||
|
||||
target 'Runner' do
|
||||
use_frameworks! :linkage => :static
|
||||
|
||||
flutter_install_all_macos_pods File.dirname(File.realpath(__FILE__))
|
||||
target 'RunnerTests' do
|
||||
inherit! :search_paths
|
||||
end
|
||||
end
|
||||
|
||||
post_install do |installer|
|
||||
installer.pods_project.targets.each do |target|
|
||||
flutter_additional_macos_build_settings(target)
|
||||
target.build_configurations.each do |config|
|
||||
config.build_settings['MACOSX_DEPLOYMENT_TARGET'] = '14.0'
|
||||
end
|
||||
end
|
||||
end
|
||||
54
flutter/macos/Podfile.lock
Normal file
@@ -0,0 +1,54 @@
|
||||
PODS:
|
||||
- audio_session (0.0.1):
|
||||
- FlutterMacOS
|
||||
- flutter_onnxruntime (0.0.1):
|
||||
- FlutterMacOS
|
||||
- onnxruntime-objc (= 1.21.0)
|
||||
- FlutterMacOS (1.0.0)
|
||||
- just_audio (0.0.1):
|
||||
- Flutter
|
||||
- FlutterMacOS
|
||||
- objective_c (0.0.1):
|
||||
- FlutterMacOS
|
||||
- onnxruntime-c (1.21.0)
|
||||
- onnxruntime-objc (1.21.0):
|
||||
- onnxruntime-objc/Core (= 1.21.0)
|
||||
- onnxruntime-objc/Core (1.21.0):
|
||||
- onnxruntime-c (= 1.21.0)
|
||||
|
||||
DEPENDENCIES:
|
||||
- audio_session (from `Flutter/ephemeral/.symlinks/plugins/audio_session/macos`)
|
||||
- flutter_onnxruntime (from `Flutter/ephemeral/.symlinks/plugins/flutter_onnxruntime/macos`)
|
||||
- FlutterMacOS (from `Flutter/ephemeral`)
|
||||
- just_audio (from `Flutter/ephemeral/.symlinks/plugins/just_audio/darwin`)
|
||||
- objective_c (from `Flutter/ephemeral/.symlinks/plugins/objective_c/macos`)
|
||||
|
||||
SPEC REPOS:
|
||||
trunk:
|
||||
- onnxruntime-c
|
||||
- onnxruntime-objc
|
||||
|
||||
EXTERNAL SOURCES:
|
||||
audio_session:
|
||||
:path: Flutter/ephemeral/.symlinks/plugins/audio_session/macos
|
||||
flutter_onnxruntime:
|
||||
:path: Flutter/ephemeral/.symlinks/plugins/flutter_onnxruntime/macos
|
||||
FlutterMacOS:
|
||||
:path: Flutter/ephemeral
|
||||
just_audio:
|
||||
:path: Flutter/ephemeral/.symlinks/plugins/just_audio/darwin
|
||||
objective_c:
|
||||
:path: Flutter/ephemeral/.symlinks/plugins/objective_c/macos
|
||||
|
||||
SPEC CHECKSUMS:
|
||||
audio_session: 728ae3823d914f809c485d390274861a24b0904e
|
||||
flutter_onnxruntime: e6887abc1032d3e5c92f84b912ad42c33e9ce1c9
|
||||
FlutterMacOS: d0db08ddef1a9af05a5ec4b724367152bb0500b1
|
||||
just_audio: a42c63806f16995daf5b219ae1d679deb76e6a79
|
||||
objective_c: e5f8194456e8fc943e034d1af00510a1bc29c067
|
||||
onnxruntime-c: ac65025f01072d25d7d394a2b43ac30d9397b260
|
||||
onnxruntime-objc: 5fa03134356d47b642ec85b1023d9907a123d201
|
||||
|
||||
PODFILE CHECKSUM: 6b8e7008b8bf73cd361b3ffb8aa3768b71e74409
|
||||
|
||||
COCOAPODS: 1.16.2
|
||||
13
flutter/macos/Runner/AppDelegate.swift
Normal file
@@ -0,0 +1,13 @@
|
||||
import Cocoa
|
||||
import FlutterMacOS
|
||||
|
||||
@main
|
||||
class AppDelegate: FlutterAppDelegate {
|
||||
override func applicationShouldTerminateAfterLastWindowClosed(_ sender: NSApplication) -> Bool {
|
||||
return true
|
||||
}
|
||||
|
||||
override func applicationSupportsSecureRestorableState(_ app: NSApplication) -> Bool {
|
||||
return true
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,68 @@
|
||||
{
|
||||
"images" : [
|
||||
{
|
||||
"size" : "16x16",
|
||||
"idiom" : "mac",
|
||||
"filename" : "app_icon_16.png",
|
||||
"scale" : "1x"
|
||||
},
|
||||
{
|
||||
"size" : "16x16",
|
||||
"idiom" : "mac",
|
||||
"filename" : "app_icon_32.png",
|
||||
"scale" : "2x"
|
||||
},
|
||||
{
|
||||
"size" : "32x32",
|
||||
"idiom" : "mac",
|
||||
"filename" : "app_icon_32.png",
|
||||
"scale" : "1x"
|
||||
},
|
||||
{
|
||||
"size" : "32x32",
|
||||
"idiom" : "mac",
|
||||
"filename" : "app_icon_64.png",
|
||||
"scale" : "2x"
|
||||
},
|
||||
{
|
||||
"size" : "128x128",
|
||||
"idiom" : "mac",
|
||||
"filename" : "app_icon_128.png",
|
||||
"scale" : "1x"
|
||||
},
|
||||
{
|
||||
"size" : "128x128",
|
||||
"idiom" : "mac",
|
||||
"filename" : "app_icon_256.png",
|
||||
"scale" : "2x"
|
||||
},
|
||||
{
|
||||
"size" : "256x256",
|
||||
"idiom" : "mac",
|
||||
"filename" : "app_icon_256.png",
|
||||
"scale" : "1x"
|
||||
},
|
||||
{
|
||||
"size" : "256x256",
|
||||
"idiom" : "mac",
|
||||
"filename" : "app_icon_512.png",
|
||||
"scale" : "2x"
|
||||
},
|
||||
{
|
||||
"size" : "512x512",
|
||||
"idiom" : "mac",
|
||||
"filename" : "app_icon_512.png",
|
||||
"scale" : "1x"
|
||||
},
|
||||
{
|
||||
"size" : "512x512",
|
||||
"idiom" : "mac",
|
||||
"filename" : "app_icon_1024.png",
|
||||
"scale" : "2x"
|
||||
}
|
||||
],
|
||||
"info" : {
|
||||
"version" : 1,
|
||||
"author" : "xcode"
|
||||
}
|
||||
}
|
||||
|
After Width: | Height: | Size: 101 KiB |
|
After Width: | Height: | Size: 5.5 KiB |
|
After Width: | Height: | Size: 520 B |
|
After Width: | Height: | Size: 14 KiB |
|
After Width: | Height: | Size: 1.0 KiB |
|
After Width: | Height: | Size: 36 KiB |
|
After Width: | Height: | Size: 2.2 KiB |
343
flutter/macos/Runner/Base.lproj/MainMenu.xib
Normal file
@@ -0,0 +1,343 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<document type="com.apple.InterfaceBuilder3.Cocoa.XIB" version="3.0" toolsVersion="14490.70" targetRuntime="MacOSX.Cocoa" propertyAccessControl="none" useAutolayout="YES" customObjectInstantitationMethod="direct">
|
||||
<dependencies>
|
||||
<deployment identifier="macosx"/>
|
||||
<plugIn identifier="com.apple.InterfaceBuilder.CocoaPlugin" version="14490.70"/>
|
||||
<capability name="documents saved in the Xcode 8 format" minToolsVersion="8.0"/>
|
||||
</dependencies>
|
||||
<objects>
|
||||
<customObject id="-2" userLabel="File's Owner" customClass="NSApplication">
|
||||
<connections>
|
||||
<outlet property="delegate" destination="Voe-Tx-rLC" id="GzC-gU-4Uq"/>
|
||||
</connections>
|
||||
</customObject>
|
||||
<customObject id="-1" userLabel="First Responder" customClass="FirstResponder"/>
|
||||
<customObject id="-3" userLabel="Application" customClass="NSObject"/>
|
||||
<customObject id="Voe-Tx-rLC" customClass="AppDelegate" customModule="Runner" customModuleProvider="target">
|
||||
<connections>
|
||||
<outlet property="applicationMenu" destination="uQy-DD-JDr" id="XBo-yE-nKs"/>
|
||||
<outlet property="mainFlutterWindow" destination="QvC-M9-y7g" id="gIp-Ho-8D9"/>
|
||||
</connections>
|
||||
</customObject>
|
||||
<customObject id="YLy-65-1bz" customClass="NSFontManager"/>
|
||||
<menu title="Main Menu" systemMenu="main" id="AYu-sK-qS6">
|
||||
<items>
|
||||
<menuItem title="APP_NAME" id="1Xt-HY-uBw">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<menu key="submenu" title="APP_NAME" systemMenu="apple" id="uQy-DD-JDr">
|
||||
<items>
|
||||
<menuItem title="About APP_NAME" id="5kV-Vb-QxS">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="orderFrontStandardAboutPanel:" target="-1" id="Exp-CZ-Vem"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem isSeparatorItem="YES" id="VOq-y0-SEH"/>
|
||||
<menuItem title="Preferences…" keyEquivalent="," id="BOF-NM-1cW"/>
|
||||
<menuItem isSeparatorItem="YES" id="wFC-TO-SCJ"/>
|
||||
<menuItem title="Services" id="NMo-om-nkz">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<menu key="submenu" title="Services" systemMenu="services" id="hz9-B4-Xy5"/>
|
||||
</menuItem>
|
||||
<menuItem isSeparatorItem="YES" id="4je-JR-u6R"/>
|
||||
<menuItem title="Hide APP_NAME" keyEquivalent="h" id="Olw-nP-bQN">
|
||||
<connections>
|
||||
<action selector="hide:" target="-1" id="PnN-Uc-m68"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Hide Others" keyEquivalent="h" id="Vdr-fp-XzO">
|
||||
<modifierMask key="keyEquivalentModifierMask" option="YES" command="YES"/>
|
||||
<connections>
|
||||
<action selector="hideOtherApplications:" target="-1" id="VT4-aY-XCT"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Show All" id="Kd2-mp-pUS">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="unhideAllApplications:" target="-1" id="Dhg-Le-xox"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem isSeparatorItem="YES" id="kCx-OE-vgT"/>
|
||||
<menuItem title="Quit APP_NAME" keyEquivalent="q" id="4sb-4s-VLi">
|
||||
<connections>
|
||||
<action selector="terminate:" target="-1" id="Te7-pn-YzF"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
</items>
|
||||
</menu>
|
||||
</menuItem>
|
||||
<menuItem title="Edit" id="5QF-Oa-p0T">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<menu key="submenu" title="Edit" id="W48-6f-4Dl">
|
||||
<items>
|
||||
<menuItem title="Undo" keyEquivalent="z" id="dRJ-4n-Yzg">
|
||||
<connections>
|
||||
<action selector="undo:" target="-1" id="M6e-cu-g7V"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Redo" keyEquivalent="Z" id="6dh-zS-Vam">
|
||||
<connections>
|
||||
<action selector="redo:" target="-1" id="oIA-Rs-6OD"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem isSeparatorItem="YES" id="WRV-NI-Exz"/>
|
||||
<menuItem title="Cut" keyEquivalent="x" id="uRl-iY-unG">
|
||||
<connections>
|
||||
<action selector="cut:" target="-1" id="YJe-68-I9s"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Copy" keyEquivalent="c" id="x3v-GG-iWU">
|
||||
<connections>
|
||||
<action selector="copy:" target="-1" id="G1f-GL-Joy"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Paste" keyEquivalent="v" id="gVA-U4-sdL">
|
||||
<connections>
|
||||
<action selector="paste:" target="-1" id="UvS-8e-Qdg"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Paste and Match Style" keyEquivalent="V" id="WeT-3V-zwk">
|
||||
<modifierMask key="keyEquivalentModifierMask" option="YES" command="YES"/>
|
||||
<connections>
|
||||
<action selector="pasteAsPlainText:" target="-1" id="cEh-KX-wJQ"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Delete" id="pa3-QI-u2k">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="delete:" target="-1" id="0Mk-Ml-PaM"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Select All" keyEquivalent="a" id="Ruw-6m-B2m">
|
||||
<connections>
|
||||
<action selector="selectAll:" target="-1" id="VNm-Mi-diN"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem isSeparatorItem="YES" id="uyl-h8-XO2"/>
|
||||
<menuItem title="Find" id="4EN-yA-p0u">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<menu key="submenu" title="Find" id="1b7-l0-nxx">
|
||||
<items>
|
||||
<menuItem title="Find…" tag="1" keyEquivalent="f" id="Xz5-n4-O0W">
|
||||
<connections>
|
||||
<action selector="performFindPanelAction:" target="-1" id="cD7-Qs-BN4"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Find and Replace…" tag="12" keyEquivalent="f" id="YEy-JH-Tfz">
|
||||
<modifierMask key="keyEquivalentModifierMask" option="YES" command="YES"/>
|
||||
<connections>
|
||||
<action selector="performFindPanelAction:" target="-1" id="WD3-Gg-5AJ"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Find Next" tag="2" keyEquivalent="g" id="q09-fT-Sye">
|
||||
<connections>
|
||||
<action selector="performFindPanelAction:" target="-1" id="NDo-RZ-v9R"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Find Previous" tag="3" keyEquivalent="G" id="OwM-mh-QMV">
|
||||
<connections>
|
||||
<action selector="performFindPanelAction:" target="-1" id="HOh-sY-3ay"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Use Selection for Find" tag="7" keyEquivalent="e" id="buJ-ug-pKt">
|
||||
<connections>
|
||||
<action selector="performFindPanelAction:" target="-1" id="U76-nv-p5D"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Jump to Selection" keyEquivalent="j" id="S0p-oC-mLd">
|
||||
<connections>
|
||||
<action selector="centerSelectionInVisibleArea:" target="-1" id="IOG-6D-g5B"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
</items>
|
||||
</menu>
|
||||
</menuItem>
|
||||
<menuItem title="Spelling and Grammar" id="Dv1-io-Yv7">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<menu key="submenu" title="Spelling" id="3IN-sU-3Bg">
|
||||
<items>
|
||||
<menuItem title="Show Spelling and Grammar" keyEquivalent=":" id="HFo-cy-zxI">
|
||||
<connections>
|
||||
<action selector="showGuessPanel:" target="-1" id="vFj-Ks-hy3"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Check Document Now" keyEquivalent=";" id="hz2-CU-CR7">
|
||||
<connections>
|
||||
<action selector="checkSpelling:" target="-1" id="fz7-VC-reM"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem isSeparatorItem="YES" id="bNw-od-mp5"/>
|
||||
<menuItem title="Check Spelling While Typing" id="rbD-Rh-wIN">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="toggleContinuousSpellChecking:" target="-1" id="7w6-Qz-0kB"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Check Grammar With Spelling" id="mK6-2p-4JG">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="toggleGrammarChecking:" target="-1" id="muD-Qn-j4w"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Correct Spelling Automatically" id="78Y-hA-62v">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="toggleAutomaticSpellingCorrection:" target="-1" id="2lM-Qi-WAP"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
</items>
|
||||
</menu>
|
||||
</menuItem>
|
||||
<menuItem title="Substitutions" id="9ic-FL-obx">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<menu key="submenu" title="Substitutions" id="FeM-D8-WVr">
|
||||
<items>
|
||||
<menuItem title="Show Substitutions" id="z6F-FW-3nz">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="orderFrontSubstitutionsPanel:" target="-1" id="oku-mr-iSq"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem isSeparatorItem="YES" id="gPx-C9-uUO"/>
|
||||
<menuItem title="Smart Copy/Paste" id="9yt-4B-nSM">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="toggleSmartInsertDelete:" target="-1" id="3IJ-Se-DZD"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Smart Quotes" id="hQb-2v-fYv">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="toggleAutomaticQuoteSubstitution:" target="-1" id="ptq-xd-QOA"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Smart Dashes" id="rgM-f4-ycn">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="toggleAutomaticDashSubstitution:" target="-1" id="oCt-pO-9gS"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Smart Links" id="cwL-P1-jid">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="toggleAutomaticLinkDetection:" target="-1" id="Gip-E3-Fov"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Data Detectors" id="tRr-pd-1PS">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="toggleAutomaticDataDetection:" target="-1" id="R1I-Nq-Kbl"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Text Replacement" id="HFQ-gK-NFA">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="toggleAutomaticTextReplacement:" target="-1" id="DvP-Fe-Py6"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
</items>
|
||||
</menu>
|
||||
</menuItem>
|
||||
<menuItem title="Transformations" id="2oI-Rn-ZJC">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<menu key="submenu" title="Transformations" id="c8a-y6-VQd">
|
||||
<items>
|
||||
<menuItem title="Make Upper Case" id="vmV-6d-7jI">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="uppercaseWord:" target="-1" id="sPh-Tk-edu"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Make Lower Case" id="d9M-CD-aMd">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="lowercaseWord:" target="-1" id="iUZ-b5-hil"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Capitalize" id="UEZ-Bs-lqG">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="capitalizeWord:" target="-1" id="26H-TL-nsh"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
</items>
|
||||
</menu>
|
||||
</menuItem>
|
||||
<menuItem title="Speech" id="xrE-MZ-jX0">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<menu key="submenu" title="Speech" id="3rS-ZA-NoH">
|
||||
<items>
|
||||
<menuItem title="Start Speaking" id="Ynk-f8-cLZ">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="startSpeaking:" target="-1" id="654-Ng-kyl"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Stop Speaking" id="Oyz-dy-DGm">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="stopSpeaking:" target="-1" id="dX8-6p-jy9"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
</items>
|
||||
</menu>
|
||||
</menuItem>
|
||||
</items>
|
||||
</menu>
|
||||
</menuItem>
|
||||
<menuItem title="View" id="H8h-7b-M4v">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<menu key="submenu" title="View" id="HyV-fh-RgO">
|
||||
<items>
|
||||
<menuItem title="Enter Full Screen" keyEquivalent="f" id="4J7-dP-txa">
|
||||
<modifierMask key="keyEquivalentModifierMask" control="YES" command="YES"/>
|
||||
<connections>
|
||||
<action selector="toggleFullScreen:" target="-1" id="dU3-MA-1Rq"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
</items>
|
||||
</menu>
|
||||
</menuItem>
|
||||
<menuItem title="Window" id="aUF-d1-5bR">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<menu key="submenu" title="Window" systemMenu="window" id="Td7-aD-5lo">
|
||||
<items>
|
||||
<menuItem title="Minimize" keyEquivalent="m" id="OY7-WF-poV">
|
||||
<connections>
|
||||
<action selector="performMiniaturize:" target="-1" id="VwT-WD-YPe"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem title="Zoom" id="R4o-n2-Eq4">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="performZoom:" target="-1" id="DIl-cC-cCs"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
<menuItem isSeparatorItem="YES" id="eu3-7i-yIM"/>
|
||||
<menuItem title="Bring All to Front" id="LE2-aR-0XJ">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<connections>
|
||||
<action selector="arrangeInFront:" target="-1" id="DRN-fu-gQh"/>
|
||||
</connections>
|
||||
</menuItem>
|
||||
</items>
|
||||
</menu>
|
||||
</menuItem>
|
||||
<menuItem title="Help" id="EPT-qC-fAb">
|
||||
<modifierMask key="keyEquivalentModifierMask"/>
|
||||
<menu key="submenu" title="Help" systemMenu="help" id="rJ0-wn-3NY"/>
|
||||
</menuItem>
|
||||
</items>
|
||||
<point key="canvasLocation" x="142" y="-258"/>
|
||||
</menu>
|
||||
<window title="APP_NAME" allowsToolTipsWhenApplicationIsInactive="NO" autorecalculatesKeyViewLoop="NO" releasedWhenClosed="NO" animationBehavior="default" id="QvC-M9-y7g" customClass="MainFlutterWindow" customModule="Runner" customModuleProvider="target">
|
||||
<windowStyleMask key="styleMask" titled="YES" closable="YES" miniaturizable="YES" resizable="YES"/>
|
||||
<rect key="contentRect" x="335" y="390" width="800" height="600"/>
|
||||
<rect key="screenRect" x="0.0" y="0.0" width="2560" height="1577"/>
|
||||
<view key="contentView" wantsLayer="YES" id="EiT-Mj-1SZ">
|
||||
<rect key="frame" x="0.0" y="0.0" width="800" height="600"/>
|
||||
<autoresizingMask key="autoresizingMask"/>
|
||||
</view>
|
||||
</window>
|
||||
</objects>
|
||||
</document>
|
||||
14
flutter/macos/Runner/Configs/AppInfo.xcconfig
Normal file
@@ -0,0 +1,14 @@
|
||||
// Application-level settings for the Runner target.
|
||||
//
|
||||
// This may be replaced with something auto-generated from metadata (e.g., pubspec.yaml) in the
|
||||
// future. If not, the values below would default to using the project name when this becomes a
|
||||
// 'flutter create' template.
|
||||
|
||||
// The application's name. By default this is also the title of the Flutter window.
|
||||
PRODUCT_NAME = flutter_sdk
|
||||
|
||||
// The application's bundle identifier
|
||||
PRODUCT_BUNDLE_IDENTIFIER = com.example.flutterSdk
|
||||
|
||||
// The copyright displayed in application information
|
||||
PRODUCT_COPYRIGHT = Copyright © 2025 com.example. All rights reserved.
|
||||
2
flutter/macos/Runner/Configs/Debug.xcconfig
Normal file
@@ -0,0 +1,2 @@
|
||||
#include "../../Flutter/Flutter-Debug.xcconfig"
|
||||
#include "Warnings.xcconfig"
|
||||
2
flutter/macos/Runner/Configs/Release.xcconfig
Normal file
@@ -0,0 +1,2 @@
|
||||
#include "../../Flutter/Flutter-Release.xcconfig"
|
||||
#include "Warnings.xcconfig"
|
||||
13
flutter/macos/Runner/Configs/Warnings.xcconfig
Normal file
@@ -0,0 +1,13 @@
|
||||
WARNING_CFLAGS = -Wall -Wconditional-uninitialized -Wnullable-to-nonnull-conversion -Wmissing-method-return-type -Woverlength-strings
|
||||
GCC_WARN_UNDECLARED_SELECTOR = YES
|
||||
CLANG_UNDEFINED_BEHAVIOR_SANITIZER_NULLABILITY = YES
|
||||
CLANG_WARN_UNGUARDED_AVAILABILITY = YES_AGGRESSIVE
|
||||
CLANG_WARN__DUPLICATE_METHOD_MATCH = YES
|
||||
CLANG_WARN_PRAGMA_PACK = YES
|
||||
CLANG_WARN_STRICT_PROTOTYPES = YES
|
||||
CLANG_WARN_COMMA = YES
|
||||
GCC_WARN_STRICT_SELECTOR_MATCH = YES
|
||||
CLANG_WARN_OBJC_REPEATED_USE_OF_WEAK = YES
|
||||
CLANG_WARN_OBJC_IMPLICIT_RETAIN_SELF = YES
|
||||
GCC_WARN_SHADOW = YES
|
||||
CLANG_WARN_UNREACHABLE_CODE = YES
|
||||
12
flutter/macos/Runner/DebugProfile.entitlements
Normal file
@@ -0,0 +1,12 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||||
<plist version="1.0">
|
||||
<dict>
|
||||
<key>com.apple.security.app-sandbox</key>
|
||||
<true/>
|
||||
<key>com.apple.security.cs.allow-jit</key>
|
||||
<true/>
|
||||
<key>com.apple.security.network.server</key>
|
||||
<true/>
|
||||
</dict>
|
||||
</plist>
|
||||
32
flutter/macos/Runner/Info.plist
Normal file
@@ -0,0 +1,32 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||||
<plist version="1.0">
|
||||
<dict>
|
||||
<key>CFBundleDevelopmentRegion</key>
|
||||
<string>$(DEVELOPMENT_LANGUAGE)</string>
|
||||
<key>CFBundleExecutable</key>
|
||||
<string>$(EXECUTABLE_NAME)</string>
|
||||
<key>CFBundleIconFile</key>
|
||||
<string></string>
|
||||
<key>CFBundleIdentifier</key>
|
||||
<string>$(PRODUCT_BUNDLE_IDENTIFIER)</string>
|
||||
<key>CFBundleInfoDictionaryVersion</key>
|
||||
<string>6.0</string>
|
||||
<key>CFBundleName</key>
|
||||
<string>$(PRODUCT_NAME)</string>
|
||||
<key>CFBundlePackageType</key>
|
||||
<string>APPL</string>
|
||||
<key>CFBundleShortVersionString</key>
|
||||
<string>$(FLUTTER_BUILD_NAME)</string>
|
||||
<key>CFBundleVersion</key>
|
||||
<string>$(FLUTTER_BUILD_NUMBER)</string>
|
||||
<key>LSMinimumSystemVersion</key>
|
||||
<string>$(MACOSX_DEPLOYMENT_TARGET)</string>
|
||||
<key>NSHumanReadableCopyright</key>
|
||||
<string>$(PRODUCT_COPYRIGHT)</string>
|
||||
<key>NSMainNibFile</key>
|
||||
<string>MainMenu</string>
|
||||
<key>NSPrincipalClass</key>
|
||||
<string>NSApplication</string>
|
||||
</dict>
|
||||
</plist>
|
||||
15
flutter/macos/Runner/MainFlutterWindow.swift
Normal file
@@ -0,0 +1,15 @@
|
||||
import Cocoa
|
||||
import FlutterMacOS
|
||||
|
||||
class MainFlutterWindow: NSWindow {
|
||||
override func awakeFromNib() {
|
||||
let flutterViewController = FlutterViewController()
|
||||
let windowFrame = self.frame
|
||||
self.contentViewController = flutterViewController
|
||||
self.setFrame(windowFrame, display: true)
|
||||
|
||||
RegisterGeneratedPlugins(registry: flutterViewController)
|
||||
|
||||
super.awakeFromNib()
|
||||
}
|
||||
}
|
||||
8
flutter/macos/Runner/Release.entitlements
Normal file
@@ -0,0 +1,8 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||||
<plist version="1.0">
|
||||
<dict>
|
||||
<key>com.apple.security.app-sandbox</key>
|
||||
<true/>
|
||||
</dict>
|
||||
</plist>
|
||||
12
flutter/macos/RunnerTests/RunnerTests.swift
Normal file
@@ -0,0 +1,12 @@
|
||||
import Cocoa
|
||||
import FlutterMacOS
|
||||
import XCTest
|
||||
|
||||
class RunnerTests: XCTestCase {
|
||||
|
||||
func testExample() {
|
||||
// If you add code to the Runner application, consider adding tests here.
|
||||
// See https://developer.apple.com/documentation/xctest for more information about using XCTest.
|
||||
}
|
||||
|
||||
}
|
||||
418
flutter/pubspec.lock
Normal file
@@ -0,0 +1,418 @@
|
||||
# Generated by pub
|
||||
# See https://dart.dev/tools/pub/glossary#lockfile
|
||||
packages:
|
||||
args:
|
||||
dependency: "direct main"
|
||||
description:
|
||||
name: args
|
||||
sha256: d0481093c50b1da8910eb0bb301626d4d8eb7284aa739614d2b394ee09e3ea04
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "2.7.0"
|
||||
async:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: async
|
||||
sha256: "758e6d74e971c3e5aceb4110bfd6698efc7f501675bcfe0c775459a8140750eb"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "2.13.0"
|
||||
audio_session:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: audio_session
|
||||
sha256: "8f96a7fecbb718cb093070f868b4cdcb8a9b1053dce342ff8ab2fde10eb9afb7"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "0.2.2"
|
||||
boolean_selector:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: boolean_selector
|
||||
sha256: "8aab1771e1243a5063b8b0ff68042d67334e3feab9e95b9490f9a6ebf73b42ea"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "2.1.2"
|
||||
characters:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: characters
|
||||
sha256: f71061c654a3380576a52b451dd5532377954cf9dbd272a78fc8479606670803
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "1.4.0"
|
||||
clock:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: clock
|
||||
sha256: fddb70d9b5277016c77a80201021d40a2247104d9f4aa7bab7157b7e3f05b84b
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "1.1.2"
|
||||
collection:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: collection
|
||||
sha256: "2f5709ae4d3d59dd8f7cd309b4e023046b57d8a6c82130785d2b0e5868084e76"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "1.19.1"
|
||||
crypto:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: crypto
|
||||
sha256: c8ea0233063ba03258fbcf2ca4d6dadfefe14f02fab57702265467a19f27fadf
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "3.0.7"
|
||||
fake_async:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: fake_async
|
||||
sha256: "5368f224a74523e8d2e7399ea1638b37aecfca824a3cc4dfdf77bf1fa905ac44"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "1.3.3"
|
||||
ffi:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: ffi
|
||||
sha256: "289279317b4b16eb2bb7e271abccd4bf84ec9bdcbe999e278a94b804f5630418"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "2.1.4"
|
||||
fixnum:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: fixnum
|
||||
sha256: b6dc7065e46c974bc7c5f143080a6764ec7a4be6da1285ececdc37be96de53be
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "1.1.1"
|
||||
flutter:
|
||||
dependency: "direct main"
|
||||
description: flutter
|
||||
source: sdk
|
||||
version: "0.0.0"
|
||||
flutter_lints:
|
||||
dependency: "direct dev"
|
||||
description:
|
||||
name: flutter_lints
|
||||
sha256: "5398f14efa795ffb7a33e9b6a08798b26a180edac4ad7db3f231e40f82ce11e1"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "5.0.0"
|
||||
flutter_onnxruntime:
|
||||
dependency: "direct main"
|
||||
description:
|
||||
name: flutter_onnxruntime
|
||||
sha256: "55842e69293ec52c07f3065049ff7641e94e8a6cca3f659a913d5401a3994424"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "1.6.0"
|
||||
flutter_test:
|
||||
dependency: "direct dev"
|
||||
description: flutter
|
||||
source: sdk
|
||||
version: "0.0.0"
|
||||
flutter_web_plugins:
|
||||
dependency: transitive
|
||||
description: flutter
|
||||
source: sdk
|
||||
version: "0.0.0"
|
||||
just_audio:
|
||||
dependency: "direct main"
|
||||
description:
|
||||
name: just_audio
|
||||
sha256: "9694e4734f515f2a052493d1d7e0d6de219ee0427c7c29492e246ff32a219908"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "0.10.5"
|
||||
just_audio_platform_interface:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: just_audio_platform_interface
|
||||
sha256: "2532c8d6702528824445921c5ff10548b518b13f808c2e34c2fd54793b999a6a"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "4.6.0"
|
||||
just_audio_web:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: just_audio_web
|
||||
sha256: "6ba8a2a7e87d57d32f0f7b42856ade3d6a9fbe0f1a11fabae0a4f00bb73f0663"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "0.4.16"
|
||||
leak_tracker:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: leak_tracker
|
||||
sha256: "33e2e26bdd85a0112ec15400c8cbffea70d0f9c3407491f672a2fad47915e2de"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "11.0.2"
|
||||
leak_tracker_flutter_testing:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: leak_tracker_flutter_testing
|
||||
sha256: "1dbc140bb5a23c75ea9c4811222756104fbcd1a27173f0c34ca01e16bea473c1"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "3.0.10"
|
||||
leak_tracker_testing:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: leak_tracker_testing
|
||||
sha256: "8d5a2d49f4a66b49744b23b018848400d23e54caf9463f4eb20df3eb8acb2eb1"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "3.0.2"
|
||||
lints:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: lints
|
||||
sha256: c35bb79562d980e9a453fc715854e1ed39e24e7d0297a880ef54e17f9874a9d7
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "5.1.1"
|
||||
logger:
|
||||
dependency: "direct main"
|
||||
description:
|
||||
name: logger
|
||||
sha256: a7967e31b703831a893bbc3c3dd11db08126fe5f369b5c648a36f821979f5be3
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "2.6.2"
|
||||
matcher:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: matcher
|
||||
sha256: dc58c723c3c24bf8d3e2d3ad3f2f9d7bd9cf43ec6feaa64181775e60190153f2
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "0.12.17"
|
||||
material_color_utilities:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: material_color_utilities
|
||||
sha256: f7142bb1154231d7ea5f96bc7bde4bda2a0945d2806bb11670e30b850d56bdec
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "0.11.1"
|
||||
meta:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: meta
|
||||
sha256: "23f08335362185a5ea2ad3a4e597f1375e78bce8a040df5c600c8d3552ef2394"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "1.17.0"
|
||||
objective_c:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: objective_c
|
||||
sha256: "1f81ed9e41909d44162d7ec8663b2c647c202317cc0b56d3d56f6a13146a0b64"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "9.1.0"
|
||||
path:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: path
|
||||
sha256: "75cca69d1490965be98c73ceaea117e8a04dd21217b37b292c9ddbec0d955bc5"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "1.9.1"
|
||||
path_provider:
|
||||
dependency: "direct main"
|
||||
description:
|
||||
name: path_provider
|
||||
sha256: "50c5dd5b6e1aaf6fb3a78b33f6aa3afca52bf903a8a5298f53101fdaee55bbcd"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "2.1.5"
|
||||
path_provider_android:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: path_provider_android
|
||||
sha256: f2c65e21139ce2c3dad46922be8272bb5963516045659e71bb16e151c93b580e
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "2.2.22"
|
||||
path_provider_foundation:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: path_provider_foundation
|
||||
sha256: "6192e477f34018ef1ea790c56fffc7302e3bc3efede9e798b934c252c8c105ba"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "2.5.0"
|
||||
path_provider_linux:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: path_provider_linux
|
||||
sha256: f7a1fe3a634fe7734c8d3f2766ad746ae2a2884abe22e241a8b301bf5cac3279
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "2.2.1"
|
||||
path_provider_platform_interface:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: path_provider_platform_interface
|
||||
sha256: "88f5779f72ba699763fa3a3b06aa4bf6de76c8e5de842cf6f29e2e06476c2334"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "2.1.2"
|
||||
path_provider_windows:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: path_provider_windows
|
||||
sha256: bd6f00dbd873bfb70d0761682da2b3a2c2fccc2b9e84c495821639601d81afe7
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "2.3.0"
|
||||
platform:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: platform
|
||||
sha256: "5d6b1b0036a5f331ebc77c850ebc8506cbc1e9416c27e59b439f917a902a4984"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "3.1.6"
|
||||
plugin_platform_interface:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: plugin_platform_interface
|
||||
sha256: "4820fbfdb9478b1ebae27888254d445073732dae3d6ea81f0b7e06d5dedc3f02"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "2.1.8"
|
||||
pub_semver:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: pub_semver
|
||||
sha256: "5bfcf68ca79ef689f8990d1160781b4bad40a3bd5e5218ad4076ddb7f4081585"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "2.2.0"
|
||||
rxdart:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: rxdart
|
||||
sha256: "5c3004a4a8dbb94bd4bf5412a4def4acdaa12e12f269737a5751369e12d1a962"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "0.28.0"
|
||||
sky_engine:
|
||||
dependency: transitive
|
||||
description: flutter
|
||||
source: sdk
|
||||
version: "0.0.0"
|
||||
source_span:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: source_span
|
||||
sha256: "254ee5351d6cb365c859e20ee823c3bb479bf4a293c22d17a9f1bf144ce86f7c"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "1.10.1"
|
||||
stack_trace:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: stack_trace
|
||||
sha256: "8b27215b45d22309b5cddda1aa2b19bdfec9df0e765f2de506401c071d38d1b1"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "1.12.1"
|
||||
stream_channel:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: stream_channel
|
||||
sha256: "969e04c80b8bcdf826f8f16579c7b14d780458bd97f56d107d3950fdbeef059d"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "2.1.4"
|
||||
string_scanner:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: string_scanner
|
||||
sha256: "921cd31725b72fe181906c6a94d987c78e3b98c2e205b397ea399d4054872b43"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "1.4.1"
|
||||
synchronized:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: synchronized
|
||||
sha256: c254ade258ec8282947a0acbbc90b9575b4f19673533ee46f2f6e9b3aeefd7c0
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "3.4.0"
|
||||
term_glyph:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: term_glyph
|
||||
sha256: "7f554798625ea768a7518313e58f83891c7f5024f88e46e7182a4558850a4b8e"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "1.2.2"
|
||||
test_api:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: test_api
|
||||
sha256: ab2726c1a94d3176a45960b6234466ec367179b87dd74f1611adb1f3b5fb9d55
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "0.7.7"
|
||||
typed_data:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: typed_data
|
||||
sha256: f9049c039ebfeb4cf7a7104a675823cd72dba8297f264b6637062516699fa006
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "1.4.0"
|
||||
uuid:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: uuid
|
||||
sha256: a11b666489b1954e01d992f3d601b1804a33937b5a8fe677bd26b8a9f96f96e8
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "4.5.2"
|
||||
vector_math:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: vector_math
|
||||
sha256: d530bd74fea330e6e364cda7a85019c434070188383e1cd8d9777ee586914c5b
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "2.2.0"
|
||||
vm_service:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: vm_service
|
||||
sha256: "45caa6c5917fa127b5dbcfbd1fa60b14e583afdc08bfc96dda38886ca252eb60"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "15.0.2"
|
||||
web:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: web
|
||||
sha256: "868d88a33d8a87b18ffc05f9f030ba328ffefba92d6c127917a2ba740f9cfe4a"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "1.1.1"
|
||||
xdg_directories:
|
||||
dependency: transitive
|
||||
description:
|
||||
name: xdg_directories
|
||||
sha256: "7a3f37b05d989967cdddcbb571f1ea834867ae2faa29725fd085180e0883aa15"
|
||||
url: "https://pub.dev"
|
||||
source: hosted
|
||||
version: "1.1.0"
|
||||
sdks:
|
||||
dart: ">=3.9.0 <4.0.0"
|
||||
flutter: ">=3.35.0"
|
||||
26
flutter/pubspec.yaml
Normal file
@@ -0,0 +1,26 @@
|
||||
name: flutter_sdk
|
||||
description: Supertonic Flutter SDK TTS Example
|
||||
version: 1.0.0
|
||||
|
||||
environment:
|
||||
sdk: ^3.5.0
|
||||
|
||||
dependencies:
|
||||
flutter:
|
||||
sdk: flutter
|
||||
flutter_onnxruntime: ^1.6.0
|
||||
args: ^2.4.0
|
||||
path_provider: ^2.1.1
|
||||
just_audio: ^0.10.5
|
||||
logger: ^2.0.2
|
||||
|
||||
dev_dependencies:
|
||||
flutter_test:
|
||||
sdk: flutter
|
||||
flutter_lints: ^5.0.0
|
||||
|
||||
flutter:
|
||||
assets:
|
||||
- assets/onnx/
|
||||
- assets/voice_styles/
|
||||
uses-material-design: true
|
||||
17
go/.gitignore
vendored
Normal file
@@ -0,0 +1,17 @@
|
||||
# Binaries
|
||||
tts_example
|
||||
example_onnx
|
||||
*.exe
|
||||
|
||||
# Go build artifacts
|
||||
*.o
|
||||
*.a
|
||||
*.so
|
||||
|
||||
# Results
|
||||
results/
|
||||
|
||||
# Go workspace
|
||||
go.work
|
||||
go.work.sum
|
||||
|
||||
165
go/README.md
Normal file
@@ -0,0 +1,165 @@
|
||||
# TTS ONNX Inference Examples
|
||||
|
||||
This guide provides examples for running TTS inference using `example_onnx.go`.
|
||||
|
||||
## 📰 Update News
|
||||
|
||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
||||
|
||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||
|
||||
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
|
||||
|
||||
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
|
||||
|
||||
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
|
||||
|
||||
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
|
||||
|
||||
## Installation
|
||||
|
||||
This project uses Go modules for dependency management.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
1. Install Go 1.21 or later from [https://golang.org/dl/](https://golang.org/dl/)
|
||||
2. Install ONNX Runtime C library:
|
||||
|
||||
**macOS (via Homebrew):**
|
||||
```bash
|
||||
brew install onnxruntime
|
||||
```
|
||||
|
||||
**Linux:**
|
||||
```bash
|
||||
# Download ONNX Runtime from GitHub releases
|
||||
wget https://github.com/microsoft/onnxruntime/releases/download/v1.16.0/onnxruntime-linux-x64-1.16.0.tgz
|
||||
tar -xzf onnxruntime-linux-x64-1.16.0.tgz
|
||||
sudo cp onnxruntime-linux-x64-1.16.0/lib/* /usr/local/lib/
|
||||
sudo cp -r onnxruntime-linux-x64-1.16.0/include/* /usr/local/include/
|
||||
sudo ldconfig
|
||||
```
|
||||
|
||||
### Install Go dependencies
|
||||
|
||||
```bash
|
||||
go mod download
|
||||
```
|
||||
|
||||
### Configure ONNX Runtime Library Path (Optional)
|
||||
|
||||
If the ONNX Runtime library is not in a standard location, set the environment variable:
|
||||
|
||||
**Automatic Detection (Recommended):**
|
||||
|
||||
```bash
|
||||
# macOS
|
||||
export ONNXRUNTIME_LIB_PATH=$(brew --prefix onnxruntime 2>/dev/null)/lib/libonnxruntime.dylib
|
||||
|
||||
# Linux
|
||||
export ONNXRUNTIME_LIB_PATH=$(find /usr/local/lib /usr/lib -name "libonnxruntime.so*" 2>/dev/null | head -n 1)
|
||||
```
|
||||
|
||||
**Manual Configuration:**
|
||||
|
||||
```bash
|
||||
export ONNXRUNTIME_LIB_PATH=/path/to/libonnxruntime.so # Linux
|
||||
# or
|
||||
export ONNXRUNTIME_LIB_PATH=/path/to/libonnxruntime.dylib # macOS
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
go run example_onnx.go helper.go
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
go run example_onnx.go helper.go \
|
||||
--batch \
|
||||
-voice-style "assets/voice_styles/M1.json,assets/voice_styles/F1.json" \
|
||||
-text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
|
||||
-lang "en,ko"
|
||||
```
|
||||
|
||||
This will:
|
||||
- Generate speech for 2 different voice-text-language pairs
|
||||
- Use male voice (M1.json) for the first text in English
|
||||
- Use female voice (F1.json) for the second text in Korean
|
||||
- Process both samples in a single batch
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
go run example_onnx.go helper.go \
|
||||
-total-step 10 \
|
||||
-voice-style "assets/voice_styles/M1.json" \
|
||||
-text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
### Example 4: Long-Form Inference
|
||||
The system automatically chunks long texts into manageable segments, synthesizes each segment separately, and concatenates them with natural pauses (0.3 seconds by default) into a single audio file. This happens by default when you don't use the `--batch` flag:
|
||||
|
||||
```bash
|
||||
go run example_onnx.go helper.go \
|
||||
-voice-style "assets/voice_styles/M1.json" \
|
||||
-text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Automatically split the text into chunks based on paragraph and sentence boundaries
|
||||
- Synthesize each chunk separately
|
||||
- Add 0.3 seconds of silence between chunks for natural pauses
|
||||
- Concatenate all chunks into a single audio file
|
||||
|
||||
**Note**: Automatic text chunking is disabled when using `--batch` mode. In batch mode, each text is processed as-is without chunking.
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `-use-gpu` | flag | false | Use GPU for inference (default: CPU) |
|
||||
| `-onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
||||
| `-total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `-n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `-voice-style` | str | `assets/voice_styles/M1.json` | Voice style file path(s), comma-separated |
|
||||
| `-text` | str | (long default text) | Text(s) to synthesize, pipe-separated |
|
||||
| `-lang` | str | `en` | Language(s) for synthesis, comma-separated (en, ko, es, pt, fr) |
|
||||
| `-save-dir` | str | `results` | Output directory |
|
||||
| `--batch` | flag | false | Enable batch mode (multiple text-style pairs, disables automatic chunking) |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Multilingual Support**: Use `-lang` to specify the language for each text. Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
|
||||
- **Batch Processing**: When using `--batch`, the number of `-voice-style`, `-text`, and `-lang` entries must match
|
||||
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
|
||||
- **Quality vs Speed**: Higher `-total-step` values produce better quality but take longer
|
||||
- **GPU Support**: GPU mode is not supported yet
|
||||
|
||||
## Building a Binary
|
||||
|
||||
To build a standalone executable:
|
||||
```bash
|
||||
go build -o tts_example example_onnx.go helper.go
|
||||
```
|
||||
|
||||
Then run it:
|
||||
```bash
|
||||
./tts_example -voice-style "../assets/voice_styles/M1.json" -text "Hello world"
|
||||
```
|
||||
|
||||
193
go/example_onnx.go
Normal file
@@ -0,0 +1,193 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"flag"
|
||||
"fmt"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
|
||||
ort "github.com/yalue/onnxruntime_go"
|
||||
)
|
||||
|
||||
// Args holds command line arguments
|
||||
type Args struct {
|
||||
useGPU bool
|
||||
onnxDir string
|
||||
totalStep int
|
||||
speed float64
|
||||
nTest int
|
||||
voiceStyle []string
|
||||
text []string
|
||||
lang []string
|
||||
saveDir string
|
||||
batch bool
|
||||
}
|
||||
|
||||
func parseArgs() *Args {
|
||||
args := &Args{}
|
||||
|
||||
flag.BoolVar(&args.useGPU, "use-gpu", false, "Use GPU for inference (default: CPU)")
|
||||
flag.StringVar(&args.onnxDir, "onnx-dir", "assets/onnx", "Path to ONNX model directory")
|
||||
flag.IntVar(&args.totalStep, "total-step", 5, "Number of denoising steps")
|
||||
flag.Float64Var(&args.speed, "speed", 1.05, "Speech speed factor (higher = faster)")
|
||||
flag.IntVar(&args.nTest, "n-test", 4, "Number of times to generate")
|
||||
flag.StringVar(&args.saveDir, "save-dir", "results", "Output directory")
|
||||
flag.BoolVar(&args.batch, "batch", false, "Enable batch mode (multiple text-style pairs)")
|
||||
|
||||
var voiceStyleStr, textStr, langStr string
|
||||
flag.StringVar(&voiceStyleStr, "voice-style", "assets/voice_styles/M1.json", "Voice style file path(s), comma-separated")
|
||||
flag.StringVar(&textStr, "text", "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.", "Text(s) to synthesize, pipe-separated")
|
||||
flag.StringVar(&langStr, "lang", "en", "Language(s) for synthesis, comma-separated (en, ko, es, pt, fr)")
|
||||
|
||||
flag.Parse()
|
||||
|
||||
// Parse comma-separated voice-style
|
||||
if voiceStyleStr != "" {
|
||||
args.voiceStyle = strings.Split(voiceStyleStr, ",")
|
||||
for i := range args.voiceStyle {
|
||||
args.voiceStyle[i] = strings.TrimSpace(args.voiceStyle[i])
|
||||
}
|
||||
}
|
||||
|
||||
// Parse pipe-separated text
|
||||
if textStr != "" {
|
||||
args.text = strings.Split(textStr, "|")
|
||||
for i := range args.text {
|
||||
args.text[i] = strings.TrimSpace(args.text[i])
|
||||
}
|
||||
}
|
||||
|
||||
// Parse comma-separated lang
|
||||
if langStr != "" {
|
||||
args.lang = strings.Split(langStr, ",")
|
||||
for i := range args.lang {
|
||||
args.lang[i] = strings.TrimSpace(args.lang[i])
|
||||
}
|
||||
}
|
||||
|
||||
return args
|
||||
}
|
||||
|
||||
func main() {
|
||||
fmt.Println("=== TTS Inference with ONNX Runtime (Go) ===\n")
|
||||
|
||||
// --- 1. Parse arguments --- //
|
||||
args := parseArgs()
|
||||
totalStep := args.totalStep
|
||||
speed := float32(args.speed)
|
||||
nTest := args.nTest
|
||||
saveDir := args.saveDir
|
||||
voiceStylePaths := args.voiceStyle
|
||||
textList := args.text
|
||||
langList := args.lang
|
||||
batch := args.batch
|
||||
|
||||
if batch {
|
||||
if len(voiceStylePaths) != len(textList) {
|
||||
fmt.Printf("Error: Number of voice styles (%d) must match number of texts (%d)\n",
|
||||
len(voiceStylePaths), len(textList))
|
||||
os.Exit(1)
|
||||
}
|
||||
if len(langList) != len(textList) {
|
||||
fmt.Printf("Error: Number of languages (%d) must match number of texts (%d)\n",
|
||||
len(langList), len(textList))
|
||||
os.Exit(1)
|
||||
}
|
||||
}
|
||||
|
||||
bsz := len(voiceStylePaths)
|
||||
|
||||
// Initialize ONNX Runtime
|
||||
if err := InitializeONNXRuntime(); err != nil {
|
||||
fmt.Printf("Error initializing ONNX Runtime: %v\n", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
defer ort.DestroyEnvironment()
|
||||
|
||||
// --- 2. Load config --- //
|
||||
cfg, err := LoadCfgs(args.onnxDir)
|
||||
if err != nil {
|
||||
fmt.Printf("Error loading config: %v\n", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
// --- 3. Load TTS components --- //
|
||||
textToSpeech, err := LoadTextToSpeech(args.onnxDir, args.useGPU, cfg)
|
||||
if err != nil {
|
||||
fmt.Printf("Error loading TTS components: %v\n", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
defer textToSpeech.Destroy()
|
||||
|
||||
// --- 4. Load voice styles --- //
|
||||
style, err := LoadVoiceStyle(voiceStylePaths, true)
|
||||
if err != nil {
|
||||
fmt.Printf("Error loading voice styles: %v\n", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
defer style.Destroy()
|
||||
|
||||
// --- 5. Synthesize speech --- //
|
||||
if err := os.MkdirAll(saveDir, 0755); err != nil {
|
||||
fmt.Printf("Error creating save directory: %v\n", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
for n := 0; n < nTest; n++ {
|
||||
fmt.Printf("\n[%d/%d] Starting synthesis...\n", n+1, nTest)
|
||||
|
||||
var wav []float32
|
||||
var duration []float32
|
||||
|
||||
if batch {
|
||||
Timer("Generating speech from text", func() interface{} {
|
||||
w, d, err := textToSpeech.Batch(textList, langList, style, totalStep, speed)
|
||||
if err != nil {
|
||||
fmt.Printf("Error generating speech: %v\n", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
wav = w
|
||||
duration = d
|
||||
return nil
|
||||
})
|
||||
} else {
|
||||
Timer("Generating speech from text", func() interface{} {
|
||||
w, d, err := textToSpeech.Call(textList[0], langList[0], style, totalStep, speed, 0.3)
|
||||
if err != nil {
|
||||
fmt.Printf("Error generating speech: %v\n", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
wav = w
|
||||
duration = []float32{d}
|
||||
return nil
|
||||
})
|
||||
}
|
||||
|
||||
// Save outputs
|
||||
for i := 0; i < bsz; i++ {
|
||||
fname := fmt.Sprintf("%s_%d.wav", sanitizeFilename(textList[i], 20), n+1)
|
||||
var wavOut []float64
|
||||
|
||||
if batch {
|
||||
wavOut = extractWavSegment(wav, duration[i], textToSpeech.SampleRate, i, bsz)
|
||||
} else {
|
||||
// For non-batch mode, wav is a single concatenated audio
|
||||
wavLen := int(float32(textToSpeech.SampleRate) * duration[0])
|
||||
wavOut = make([]float64, wavLen)
|
||||
for j := 0; j < wavLen && j < len(wav); j++ {
|
||||
wavOut[j] = float64(wav[j])
|
||||
}
|
||||
}
|
||||
|
||||
outputPath := filepath.Join(saveDir, fname)
|
||||
if err := writeWavFile(outputPath, wavOut, textToSpeech.SampleRate); err != nil {
|
||||
fmt.Printf("Error writing wav file: %v\n", err)
|
||||
continue
|
||||
}
|
||||
fmt.Printf("Saved: %s\n", outputPath)
|
||||
}
|
||||
}
|
||||
|
||||
fmt.Println("\n=== Synthesis completed successfully! ===")
|
||||
}
|
||||
13
go/go.mod
Normal file
@@ -0,0 +1,13 @@
|
||||
module supertonic-tts
|
||||
|
||||
go 1.21
|
||||
|
||||
require (
|
||||
github.com/go-audio/audio v1.0.0
|
||||
github.com/go-audio/wav v1.1.0
|
||||
github.com/mjibson/go-dsp v0.0.0-20180508042940-11479a337f12
|
||||
github.com/yalue/onnxruntime_go v1.11.0
|
||||
golang.org/x/text v0.14.0
|
||||
)
|
||||
|
||||
require github.com/go-audio/riff v1.0.0 // indirect
|
||||
12
go/go.sum
Normal file
@@ -0,0 +1,12 @@
|
||||
github.com/go-audio/audio v1.0.0 h1:zS9vebldgbQqktK4H0lUqWrG8P0NxCJVqcj7ZpNnwd4=
|
||||
github.com/go-audio/audio v1.0.0/go.mod h1:6uAu0+H2lHkwdGsAY+j2wHPNPpPoeg5AaEFh9FlA+Zs=
|
||||
github.com/go-audio/riff v1.0.0 h1:d8iCGbDvox9BfLagY94fBynxSPHO80LmZCaOsmKxokA=
|
||||
github.com/go-audio/riff v1.0.0/go.mod h1:l3cQwc85y79NQFCRB7TiPoNiaijp6q8Z0Uv38rVG498=
|
||||
github.com/go-audio/wav v1.1.0 h1:jQgLtbqBzY7G+BM8fXF7AHUk1uHUviWS4X39d5rsL2g=
|
||||
github.com/go-audio/wav v1.1.0/go.mod h1:mpe9qfwbScEbkd8uybLuIpTgHyrISw/OTuvjUW2iGtE=
|
||||
github.com/mjibson/go-dsp v0.0.0-20180508042940-11479a337f12 h1:dd7vnTDfjtwCETZDrRe+GPYNLA1jBtbZeyfyE8eZCyk=
|
||||
github.com/mjibson/go-dsp v0.0.0-20180508042940-11479a337f12/go.mod h1:i/KKcxEWEO8Yyl11DYafRPKOPVYTrhxiTRigjtEEXZU=
|
||||
github.com/yalue/onnxruntime_go v1.11.0 h1:aKH4yPIbqfcB3SfnQWq/WxzLelkyolntHnffL3eMBHY=
|
||||
github.com/yalue/onnxruntime_go v1.11.0/go.mod h1:b4X26A8pekNb1ACJ58wAXgNKeUCGEAQ9dmACut9Sm/4=
|
||||
golang.org/x/text v0.14.0 h1:ScX5w1eTa3QqT8oi6+ziP7dTV1S2+ALU0bI+0zXKWiQ=
|
||||
golang.org/x/text v0.14.0/go.mod h1:18ZOQIKpY8NJVqYksKHtTdi31H5itFRjB5/qKTNYzSU=
|
||||
1066
go/helper.go
Normal file
BIN
img/supertonic_preview_0.1.jpg
Normal file
|
After Width: | Height: | Size: 766 KiB |
BIN
img/voicebuilder_img.png
Normal file
|
After Width: | Height: | Size: 448 KiB |
10
ios/ExampleiOSApp/App.swift
Normal file
@@ -0,0 +1,10 @@
|
||||
import SwiftUI
|
||||
|
||||
@main
|
||||
struct ExampleiOSApp: App {
|
||||
var body: some Scene {
|
||||
WindowGroup {
|
||||
ContentView()
|
||||
}
|
||||
}
|
||||
}
|
||||
30
ios/ExampleiOSApp/AudioPlayer.swift
Normal file
@@ -0,0 +1,30 @@
|
||||
import Foundation
|
||||
import AVFoundation
|
||||
|
||||
final class AudioPlayer: NSObject, AVAudioPlayerDelegate {
|
||||
private var player: AVAudioPlayer?
|
||||
private var onFinish: (() -> Void)?
|
||||
|
||||
func play(url: URL, onFinish: (() -> Void)? = nil) {
|
||||
self.onFinish = onFinish
|
||||
do {
|
||||
let data = try Data(contentsOf: url)
|
||||
let player = try AVAudioPlayer(data: data)
|
||||
player.delegate = self
|
||||
player.prepareToPlay()
|
||||
player.play()
|
||||
self.player = player
|
||||
} catch {
|
||||
print("Audio play error: \(error)")
|
||||
}
|
||||
}
|
||||
|
||||
func stop() {
|
||||
player?.stop()
|
||||
player = nil
|
||||
}
|
||||
|
||||
func audioPlayerDidFinishPlaying(_ player: AVAudioPlayer, successfully flag: Bool) {
|
||||
onFinish?()
|
||||
}
|
||||
}
|
||||
99
ios/ExampleiOSApp/ContentView.swift
Normal file
@@ -0,0 +1,99 @@
|
||||
import SwiftUI
|
||||
|
||||
struct ContentView: View {
|
||||
@StateObject private var vm = TTSViewModel()
|
||||
|
||||
var body: some View {
|
||||
ZStack {
|
||||
LinearGradient(gradient: Gradient(colors: [Color(.systemBackground), Color(.secondarySystemBackground)]), startPoint: .topLeading, endPoint: .bottomTrailing)
|
||||
.ignoresSafeArea()
|
||||
|
||||
VStack(spacing: 20) {
|
||||
Spacer()
|
||||
|
||||
VStack(spacing: 12) {
|
||||
Text("Supertonic 2 iOS Demo")
|
||||
.font(.title2.weight(.semibold))
|
||||
.foregroundColor(.primary)
|
||||
|
||||
TextEditor(text: $vm.text)
|
||||
.frame(minHeight: 120, maxHeight: 180)
|
||||
.padding(8)
|
||||
.background(Color(.secondarySystemBackground))
|
||||
.cornerRadius(12)
|
||||
.overlay(
|
||||
RoundedRectangle(cornerRadius: 12)
|
||||
.stroke(Color.secondary.opacity(0.3), lineWidth: 1)
|
||||
)
|
||||
.padding(.horizontal)
|
||||
|
||||
HStack(spacing: 12) {
|
||||
Text("NFE")
|
||||
.font(.subheadline)
|
||||
.foregroundColor(.secondary)
|
||||
Slider(value: $vm.nfe, in: 2...15, step: 1)
|
||||
Text("\(Int(vm.nfe))")
|
||||
.font(.subheadline.monospacedDigit())
|
||||
.frame(width: 36)
|
||||
}
|
||||
.padding(.horizontal)
|
||||
|
||||
Picker("Voice", selection: $vm.voice) {
|
||||
Text("M").tag(TTSService.Voice.male)
|
||||
Text("F").tag(TTSService.Voice.female)
|
||||
}
|
||||
.pickerStyle(SegmentedPickerStyle())
|
||||
.padding(.horizontal)
|
||||
|
||||
HStack(spacing: 12) {
|
||||
Text("Language")
|
||||
.font(.subheadline)
|
||||
.foregroundColor(.secondary)
|
||||
Picker("Language", selection: $vm.language) {
|
||||
ForEach(TTSService.Language.allCases, id: \.self) { lang in
|
||||
Text(lang.displayName).tag(lang)
|
||||
}
|
||||
}
|
||||
.pickerStyle(MenuPickerStyle())
|
||||
}
|
||||
.padding(.horizontal)
|
||||
}
|
||||
|
||||
HStack(spacing: 16) {
|
||||
Button(action: { vm.generate() }) {
|
||||
Label(vm.isGenerating ? "Generating..." : "Generate", systemImage: vm.isGenerating ? "hourglass" : "wand.and.stars"
|
||||
)
|
||||
.labelStyle(.titleAndIcon)
|
||||
}
|
||||
.buttonStyle(.borderedProminent)
|
||||
.tint(.accentColor)
|
||||
.disabled(vm.isGenerating)
|
||||
|
||||
Button(action: { vm.togglePlay() }) {
|
||||
Label(vm.isPlaying ? "Stop" : "Play", systemImage: vm.isPlaying ? "stop.fill" : "play.fill")
|
||||
}
|
||||
.buttonStyle(.bordered)
|
||||
.disabled(vm.audioURL == nil)
|
||||
}
|
||||
|
||||
if let rtf = vm.rtfText {
|
||||
Text(rtf)
|
||||
.font(.footnote.monospacedDigit())
|
||||
.foregroundColor(.secondary)
|
||||
.padding(.top, 2)
|
||||
}
|
||||
|
||||
if let error = vm.errorMessage {
|
||||
Text(error)
|
||||
.foregroundColor(.red)
|
||||
.font(.footnote)
|
||||
.multilineTextAlignment(.center)
|
||||
.padding(.horizontal)
|
||||
}
|
||||
|
||||
Spacer()
|
||||
}
|
||||
}
|
||||
.onAppear { vm.startup() }
|
||||
}
|
||||
}
|
||||
29
ios/ExampleiOSApp/Info.plist
Normal file
@@ -0,0 +1,29 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||||
<plist version="1.0">
|
||||
<dict>
|
||||
<key>CFBundleDevelopmentRegion</key>
|
||||
<string>en</string>
|
||||
<key>CFBundleExecutable</key>
|
||||
<string>$(EXECUTABLE_NAME)</string>
|
||||
<key>CFBundleIdentifier</key>
|
||||
<string>$(PRODUCT_BUNDLE_IDENTIFIER)</string>
|
||||
<key>CFBundleInfoDictionaryVersion</key>
|
||||
<string>6.0</string>
|
||||
<key>CFBundleName</key>
|
||||
<string>ExampleiOSApp</string>
|
||||
<key>CFBundlePackageType</key>
|
||||
<string>APPL</string>
|
||||
<key>CFBundleShortVersionString</key>
|
||||
<string>1.0</string>
|
||||
<key>CFBundleVersion</key>
|
||||
<string>1</string>
|
||||
<key>UILaunchScreen</key>
|
||||
<dict/>
|
||||
<key>UIApplicationSceneManifest</key>
|
||||
<dict>
|
||||
<key>UIApplicationSupportsMultipleScenes</key>
|
||||
<false/>
|
||||
</dict>
|
||||
</dict>
|
||||
</plist>
|
||||
114
ios/ExampleiOSApp/TTSService.swift
Normal file
@@ -0,0 +1,114 @@
|
||||
import Foundation
|
||||
import OnnxRuntimeBindings
|
||||
|
||||
final class TTSService {
|
||||
enum Voice { case male, female }
|
||||
enum Language: String, CaseIterable {
|
||||
case en = "en"
|
||||
case ko = "ko"
|
||||
case es = "es"
|
||||
case pt = "pt"
|
||||
case fr = "fr"
|
||||
|
||||
var displayName: String {
|
||||
switch self {
|
||||
case .en: return "English"
|
||||
case .ko: return "한국어"
|
||||
case .es: return "Español"
|
||||
case .pt: return "Português"
|
||||
case .fr: return "Français"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private let env: ORTEnv
|
||||
private let textToSpeech: TextToSpeech
|
||||
private let bundleOnnxDir: String
|
||||
private let sampleRate: Int
|
||||
|
||||
init() throws {
|
||||
bundleOnnxDir = try Self.locateOnnxDirInBundle()
|
||||
env = try ORTEnv(loggingLevel: .warning)
|
||||
textToSpeech = try loadTextToSpeech(bundleOnnxDir, false, env)
|
||||
sampleRate = textToSpeech.sampleRate
|
||||
}
|
||||
|
||||
func synthesize(text: String, nfe: Int, voice: Voice, language: Language) async throws -> URL {
|
||||
// Load style for the selected voice
|
||||
let styleURL = try Self.locateVoiceStyleURL(voice: voice)
|
||||
let style = try loadVoiceStyle([styleURL.path], verbose: false)
|
||||
|
||||
// 2) Synthesize via packed TextToSpeech component
|
||||
let (wav, duration) = try textToSpeech.call(text, language.rawValue, style, nfe)
|
||||
let audioSeconds = Double(duration)
|
||||
let wavLenSample = min(Int(Double(sampleRate) * audioSeconds), wav.count)
|
||||
let wavOut = Array(wav[0..<wavLenSample])
|
||||
|
||||
let tmpURL = FileManager.default.temporaryDirectory.appendingPathComponent("supertonic_tts_\(UUID().uuidString).wav")
|
||||
try writeWavFile(tmpURL.path, wavOut, sampleRate)
|
||||
|
||||
return tmpURL
|
||||
}
|
||||
|
||||
// MARK: - Resource location helpers
|
||||
private static func locateOnnxDirInBundle() throws -> String {
|
||||
let bundle = Bundle.main
|
||||
let fm = FileManager.default
|
||||
|
||||
func dirHasRequiredFiles(_ dir: URL) -> Bool {
|
||||
let required = [
|
||||
"tts.json",
|
||||
"duration_predictor.onnx",
|
||||
"text_encoder.onnx",
|
||||
"vector_estimator.onnx",
|
||||
"vocoder.onnx"
|
||||
]
|
||||
return required.allSatisfy { fm.fileExists(atPath: dir.appendingPathComponent($0).path) }
|
||||
}
|
||||
|
||||
var candidates: [URL] = []
|
||||
if let dir = bundle.resourceURL?.appendingPathComponent("onnx", isDirectory: true) { candidates.append(dir) }
|
||||
if let dir = bundle.resourceURL?.appendingPathComponent("assets/onnx", isDirectory: true) { candidates.append(dir) }
|
||||
if let url = bundle.url(forResource: "tts", withExtension: "json", subdirectory: "onnx") { candidates.append(url.deletingLastPathComponent()) }
|
||||
if let url = bundle.url(forResource: "tts", withExtension: "json", subdirectory: "assets/onnx") { candidates.append(url.deletingLastPathComponent()) }
|
||||
if let url = bundle.url(forResource: "tts", withExtension: "json", subdirectory: nil) { candidates.append(url.deletingLastPathComponent()) }
|
||||
if let root = bundle.resourceURL { candidates.append(root) }
|
||||
|
||||
for dir in candidates {
|
||||
if dirHasRequiredFiles(dir) { return dir.path }
|
||||
}
|
||||
throw NSError(
|
||||
domain: "TTS",
|
||||
code: -100,
|
||||
userInfo: [NSLocalizedDescriptionKey: "Could not find the onnx directory in the bundle. Please make sure the onnx folder (as a folder reference) is included in Copy Bundle Resources in Xcode."]
|
||||
)
|
||||
}
|
||||
|
||||
private static func locateVoiceStyleURL(voice: Voice) throws -> URL {
|
||||
// Prefer M1/F1 defaults; search common subdirectories
|
||||
let fileName = (voice == .male) ? "M1" : "F1"
|
||||
let bundle = Bundle.main
|
||||
let candidates: [URL?] = [
|
||||
bundle.url(forResource: fileName, withExtension: "json", subdirectory: "voice_styles"),
|
||||
bundle.url(forResource: fileName, withExtension: "json", subdirectory: "assets/voice_styles"),
|
||||
bundle.url(forResource: fileName, withExtension: "json", subdirectory: nil)
|
||||
]
|
||||
for url in candidates {
|
||||
if let url = url { return url }
|
||||
}
|
||||
// Fallback: scan folders if needed
|
||||
if let folder1 = bundle.resourceURL?.appendingPathComponent("voice_styles", isDirectory: true) {
|
||||
let file = folder1.appendingPathComponent("\(fileName).json")
|
||||
if FileManager.default.fileExists(atPath: file.path) { return file }
|
||||
}
|
||||
if let folder2 = bundle.resourceURL?.appendingPathComponent("assets/voice_styles", isDirectory: true) {
|
||||
let file = folder2.appendingPathComponent("\(fileName).json")
|
||||
if FileManager.default.fileExists(atPath: file.path) { return file }
|
||||
}
|
||||
throw NSError(
|
||||
domain: "TTS",
|
||||
code: -102,
|
||||
userInfo: [NSLocalizedDescriptionKey: "Could not find the voice style JSON (\(fileName).json) in the bundle. Ensure voice_styles folder is included in Copy Bundle Resources."]
|
||||
)
|
||||
}
|
||||
}
|
||||
82
ios/ExampleiOSApp/TTSViewModel.swift
Normal file
@@ -0,0 +1,82 @@
|
||||
import Foundation
|
||||
import AVFoundation
|
||||
|
||||
@MainActor
|
||||
final class TTSViewModel: ObservableObject {
|
||||
@Published var text: String = "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
@Published var nfe: Double = 5
|
||||
@Published var voice: TTSService.Voice = .male
|
||||
@Published var language: TTSService.Language = .en
|
||||
@Published var isGenerating: Bool = false
|
||||
@Published var isPlaying: Bool = false
|
||||
@Published var errorMessage: String?
|
||||
@Published var audioURL: URL?
|
||||
@Published var elapsedSeconds: Double?
|
||||
@Published var audioSeconds: Double?
|
||||
|
||||
private var service: TTSService?
|
||||
private var player = AudioPlayer()
|
||||
|
||||
var rtfText: String? {
|
||||
guard let e = elapsedSeconds, let a = audioSeconds, a > 0 else { return nil }
|
||||
return String(format: "RTF %.2fx · %.2fs / %.2fs", e / a, e, a)
|
||||
}
|
||||
|
||||
func startup() {
|
||||
do {
|
||||
service = try TTSService()
|
||||
} catch {
|
||||
errorMessage = "Failed to init TTS: \(error.localizedDescription)"
|
||||
}
|
||||
}
|
||||
|
||||
func generate() {
|
||||
guard let service = service else { return }
|
||||
isGenerating = true
|
||||
errorMessage = nil
|
||||
audioURL = nil
|
||||
elapsedSeconds = nil
|
||||
audioSeconds = nil
|
||||
Task {
|
||||
let tic = Date()
|
||||
do {
|
||||
let url = try await service.synthesize(text: text, nfe: Int(nfe), voice: voice, language: language)
|
||||
let elapsed = Date().timeIntervalSince(tic)
|
||||
let audio = audioDuration(at: url)
|
||||
await MainActor.run {
|
||||
self.audioURL = url
|
||||
self.elapsedSeconds = elapsed
|
||||
self.audioSeconds = audio
|
||||
self.isGenerating = false
|
||||
self.play(url: url)
|
||||
}
|
||||
} catch {
|
||||
await MainActor.run {
|
||||
self.errorMessage = error.localizedDescription
|
||||
self.isGenerating = false
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func togglePlay() {
|
||||
if isPlaying {
|
||||
player.stop()
|
||||
isPlaying = false
|
||||
} else if let url = audioURL {
|
||||
play(url: url)
|
||||
}
|
||||
}
|
||||
|
||||
private func play(url: URL) {
|
||||
player.play(url: url) { [weak self] in
|
||||
DispatchQueue.main.async { self?.isPlaying = false }
|
||||
}
|
||||
isPlaying = true
|
||||
}
|
||||
|
||||
private func audioDuration(at url: URL) -> Double? {
|
||||
guard let file = try? AVAudioFile(forReading: url) else { return nil }
|
||||
return Double(file.length) / file.fileFormat.sampleRate
|
||||
}
|
||||
}
|
||||
29
ios/ExampleiOSApp/project.yml
Normal file
@@ -0,0 +1,29 @@
|
||||
name: ExampleiOSApp
|
||||
options:
|
||||
minimumXcodeGenVersion: 2.37.0
|
||||
packages:
|
||||
onnxruntime:
|
||||
url: https://github.com/microsoft/onnxruntime-swift-package-manager.git
|
||||
from: 1.16.0
|
||||
targets:
|
||||
ExampleiOSApp:
|
||||
type: application
|
||||
platform: iOS
|
||||
deploymentTarget: "15.0"
|
||||
sources:
|
||||
- path: .
|
||||
- path: ../../swift/Sources/Helper.swift
|
||||
type: file
|
||||
resources:
|
||||
- path: onnx
|
||||
type: folder
|
||||
- path: audio
|
||||
type: folder
|
||||
settings:
|
||||
base:
|
||||
PRODUCT_BUNDLE_IDENTIFIER: com.supertonic.ExampleiOSApp
|
||||
SWIFT_VERSION: 5.9
|
||||
INFOPLIST_FILE: Info.plist
|
||||
dependencies:
|
||||
- package: onnxruntime
|
||||
product: onnxruntime
|
||||
78
ios/README.md
Normal file
@@ -0,0 +1,78 @@
|
||||
# Supertonic iOS Example App
|
||||
|
||||
A minimal iOS demo that runs Supertonic 2 (ONNX Runtime) on-device. The app shows:
|
||||
- Multiline text input
|
||||
- NFE (denoising steps) slider
|
||||
- Voice toggle (M/F)
|
||||
- Language selector (en, ko, es, pt, fr)
|
||||
- Generate & Play buttons
|
||||
- RTF display (Elapsed / Audio seconds)
|
||||
|
||||
All ONNX models/configs are reused from `Supertonic/assets/onnx`, and voice style JSON files from `Supertonic/assets/voice_styles`.
|
||||
|
||||
## 📰 Update News
|
||||
|
||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
||||
|
||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||
|
||||
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
|
||||
|
||||
## Prerequisites
|
||||
- macOS 13+, Xcode 15+
|
||||
- Swift 5.9+
|
||||
- iOS 15+ device (recommended)
|
||||
- Homebrew, XcodeGen
|
||||
|
||||
Install tools (if needed):
|
||||
```bash
|
||||
brew install xcodegen
|
||||
```
|
||||
|
||||
## Quick Start (zero-click in Xcode)
|
||||
0) Prepare assets next to the iOS target (one-time)
|
||||
```bash
|
||||
cd ios/ExampleiOSApp
|
||||
mkdir -p onnx voice_styles
|
||||
rsync -a ../../assets/onnx/ onnx/
|
||||
rsync -a ../../assets/voice_styles/ voice_styles/
|
||||
```
|
||||
|
||||
1) Generate the Xcode project with XcodeGen
|
||||
```bash
|
||||
xcodegen generate
|
||||
open ExampleiOSApp.xcodeproj
|
||||
```
|
||||
|
||||
2) Open in Xcode and select your iPhone as the run destination
|
||||
- Targets → ExampleiOSApp → Signing & Capabilities: Select your Team
|
||||
- iOS Deployment Target: 15.0+
|
||||
|
||||
3) Build & Run on device
|
||||
- Type text → adjust NFE/Voice → Tap Generate → Audio plays automatically
|
||||
- An RTF line shows like: `RTF 0.30x · 3.04s / 10.11s`
|
||||
|
||||
## What's included (generated project)
|
||||
- SwiftUI app files: `App.swift`, `ContentView.swift`, `TTSViewModel.swift`, `AudioPlayer.swift`
|
||||
- Runtime wrapper: `TTSService.swift` (includes TTS inference logic)
|
||||
- Resources (local, vendored in `ios/ExampleiOSApp/onnx` and `ios/ExampleiOSApp/voice_styles` after step 0)
|
||||
|
||||
These references are defined in `project.yml` and added to the app bundle by XcodeGen.
|
||||
|
||||
## App Controls
|
||||
- **Text**: Multiline `TextEditor`
|
||||
- **NFE**: Denoising steps (default 5)
|
||||
- **Voice**: M/F voice style selector
|
||||
- **Language**: Language selector (English, 한국어, Español, Português, Français)
|
||||
- **Generate**: Runs end-to-end synthesis
|
||||
- **Play/Stop**: Controls playback of the last output
|
||||
- **RTF**: Shows Elapsed / Audio seconds for quick performance intuition
|
||||
|
||||
## Multilingual Support
|
||||
|
||||
Supertonic 2 supports multiple languages. Select the appropriate language for your input text:
|
||||
- **English (en)**: Default language
|
||||
- **한국어 (ko)**: Korean
|
||||
- **Español (es)**: Spanish
|
||||
- **Português (pt)**: Portuguese
|
||||
- **Français (fr)**: French
|
||||
35
java/.gitignore
vendored
Normal file
@@ -0,0 +1,35 @@
|
||||
# Maven
|
||||
target/
|
||||
pom.xml.tag
|
||||
pom.xml.releaseBackup
|
||||
pom.xml.versionsBackup
|
||||
pom.xml.next
|
||||
release.properties
|
||||
dependency-reduced-pom.xml
|
||||
buildNumber.properties
|
||||
.mvn/timing.properties
|
||||
.mvn/wrapper/maven-wrapper.jar
|
||||
|
||||
# Compiled class files
|
||||
*.class
|
||||
|
||||
# IntelliJ IDEA
|
||||
.idea/
|
||||
*.iml
|
||||
*.iws
|
||||
*.ipr
|
||||
|
||||
# Eclipse
|
||||
.classpath
|
||||
.project
|
||||
.settings/
|
||||
|
||||
# VS Code
|
||||
.vscode/
|
||||
|
||||
# Results
|
||||
results/*.wav
|
||||
|
||||
# Mac
|
||||
.DS_Store
|
||||
|
||||
183
java/ExampleONNX.java
Normal file
@@ -0,0 +1,183 @@
|
||||
import ai.onnxruntime.*;
|
||||
|
||||
import java.io.File;
|
||||
import java.util.*;
|
||||
|
||||
/**
|
||||
* TTS Inference Example with ONNX Runtime (Java)
|
||||
*/
|
||||
public class ExampleONNX {
|
||||
|
||||
/**
|
||||
* Command line arguments
|
||||
*/
|
||||
static class Args {
|
||||
boolean useGpu = false;
|
||||
String onnxDir = "assets/onnx";
|
||||
int totalStep = 5;
|
||||
float speed = 1.05f;
|
||||
int nTest = 4;
|
||||
List<String> voiceStyle = Arrays.asList("assets/voice_styles/M1.json");
|
||||
List<String> text = Arrays.asList(
|
||||
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
);
|
||||
List<String> lang = Arrays.asList("en");
|
||||
String saveDir = "results";
|
||||
boolean batch = false;
|
||||
}
|
||||
|
||||
/**
|
||||
* Parse command line arguments
|
||||
*/
|
||||
private static Args parseArgs(String[] args) {
|
||||
Args result = new Args();
|
||||
|
||||
for (int i = 0; i < args.length; i++) {
|
||||
switch (args[i]) {
|
||||
case "--use-gpu":
|
||||
result.useGpu = true;
|
||||
break;
|
||||
case "--onnx-dir":
|
||||
if (i + 1 < args.length) result.onnxDir = args[++i];
|
||||
break;
|
||||
case "--total-step":
|
||||
if (i + 1 < args.length) result.totalStep = Integer.parseInt(args[++i]);
|
||||
break;
|
||||
case "--speed":
|
||||
if (i + 1 < args.length) result.speed = Float.parseFloat(args[++i]);
|
||||
break;
|
||||
case "--n-test":
|
||||
if (i + 1 < args.length) result.nTest = Integer.parseInt(args[++i]);
|
||||
break;
|
||||
case "--voice-style":
|
||||
if (i + 1 < args.length) {
|
||||
result.voiceStyle = Arrays.asList(args[++i].split(","));
|
||||
}
|
||||
break;
|
||||
case "--text":
|
||||
if (i + 1 < args.length) {
|
||||
result.text = Arrays.asList(args[++i].split("\\|"));
|
||||
}
|
||||
break;
|
||||
case "--lang":
|
||||
if (i + 1 < args.length) {
|
||||
result.lang = Arrays.asList(args[++i].split(","));
|
||||
}
|
||||
break;
|
||||
case "--save-dir":
|
||||
if (i + 1 < args.length) result.saveDir = args[++i];
|
||||
break;
|
||||
case "--batch":
|
||||
result.batch = true;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
/**
|
||||
* Main inference function
|
||||
*/
|
||||
public static void main(String[] args) {
|
||||
try {
|
||||
System.out.println("=== TTS Inference with ONNX Runtime (Java) ===\n");
|
||||
|
||||
// --- 1. Parse arguments --- //
|
||||
Args parsedArgs = parseArgs(args);
|
||||
int totalStep = parsedArgs.totalStep;
|
||||
float speed = parsedArgs.speed;
|
||||
int nTest = parsedArgs.nTest;
|
||||
String saveDir = parsedArgs.saveDir;
|
||||
List<String> voiceStylePaths = parsedArgs.voiceStyle;
|
||||
List<String> textList = parsedArgs.text;
|
||||
List<String> langList = parsedArgs.lang;
|
||||
boolean batch = parsedArgs.batch;
|
||||
|
||||
if (batch) {
|
||||
if (voiceStylePaths.size() != textList.size()) {
|
||||
throw new RuntimeException("Number of voice styles (" + voiceStylePaths.size() +
|
||||
") must match number of texts (" + textList.size() + ")");
|
||||
}
|
||||
if (langList.size() != textList.size()) {
|
||||
throw new RuntimeException("Number of languages (" + langList.size() +
|
||||
") must match number of texts (" + textList.size() + ")");
|
||||
}
|
||||
}
|
||||
|
||||
int bsz = voiceStylePaths.size();
|
||||
OrtEnvironment env = OrtEnvironment.getEnvironment();
|
||||
|
||||
// --- 2. Load TTS components --- //
|
||||
TextToSpeech textToSpeech = Helper.loadTextToSpeech(parsedArgs.onnxDir, parsedArgs.useGpu, env);
|
||||
|
||||
// --- 3. Load voice styles --- //
|
||||
Style style = Helper.loadVoiceStyle(voiceStylePaths, true, env);
|
||||
|
||||
// --- 4. Synthesize speech --- //
|
||||
File saveDirFile = new File(saveDir);
|
||||
if (!saveDirFile.exists()) {
|
||||
saveDirFile.mkdirs();
|
||||
}
|
||||
|
||||
for (int n = 0; n < nTest; n++) {
|
||||
System.out.println("\n[" + (n + 1) + "/" + nTest + "] Starting synthesis...");
|
||||
|
||||
TTSResult ttsResult;
|
||||
if (batch) {
|
||||
ttsResult = Helper.timer("Generating speech from text", () -> {
|
||||
try {
|
||||
return textToSpeech.batch(textList, langList, style, totalStep, speed, env);
|
||||
} catch (Exception e) {
|
||||
throw new RuntimeException(e);
|
||||
}
|
||||
});
|
||||
} else {
|
||||
ttsResult = Helper.timer("Generating speech from text", () -> {
|
||||
try {
|
||||
return textToSpeech.call(textList.get(0), langList.get(0), style, totalStep, speed, 0.3f, env);
|
||||
} catch (Exception e) {
|
||||
throw new RuntimeException(e);
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
float[] wav = ttsResult.wav;
|
||||
float[] duration = ttsResult.duration;
|
||||
|
||||
// Save outputs
|
||||
for (int i = 0; i < bsz; i++) {
|
||||
String fname = Helper.sanitizeFilename(textList.get(i), 20) + "_" + (n + 1) + ".wav";
|
||||
float[] wavOut;
|
||||
|
||||
if (batch) {
|
||||
int wavLen = wav.length / bsz;
|
||||
int actualLen = (int) (textToSpeech.sampleRate * duration[i]);
|
||||
wavOut = new float[actualLen];
|
||||
System.arraycopy(wav, i * wavLen, wavOut, 0, Math.min(actualLen, wavLen));
|
||||
} else {
|
||||
// For non-batch mode, wav is a single concatenated audio
|
||||
int actualLen = (int) (textToSpeech.sampleRate * duration[0]);
|
||||
wavOut = new float[Math.min(actualLen, wav.length)];
|
||||
System.arraycopy(wav, 0, wavOut, 0, wavOut.length);
|
||||
}
|
||||
|
||||
String outputPath = saveDir + "/" + fname;
|
||||
Helper.writeWavFile(outputPath, wavOut, textToSpeech.sampleRate);
|
||||
System.out.println("Saved: " + outputPath);
|
||||
}
|
||||
}
|
||||
|
||||
// Clean up
|
||||
style.close();
|
||||
textToSpeech.close();
|
||||
|
||||
System.out.println("\n=== Synthesis completed successfully! ===");
|
||||
|
||||
} catch (Exception e) {
|
||||
System.err.println("Error during inference: " + e.getMessage());
|
||||
e.printStackTrace();
|
||||
System.exit(1);
|
||||
}
|
||||
}
|
||||
}
|
||||
955
java/Helper.java
Normal file
@@ -0,0 +1,955 @@
|
||||
import ai.onnxruntime.*;
|
||||
import com.fasterxml.jackson.databind.JsonNode;
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
|
||||
import javax.sound.sampled.AudioFileFormat;
|
||||
import javax.sound.sampled.AudioFormat;
|
||||
import javax.sound.sampled.AudioInputStream;
|
||||
import javax.sound.sampled.AudioSystem;
|
||||
import java.io.*;
|
||||
import java.nio.ByteBuffer;
|
||||
import java.nio.ByteOrder;
|
||||
import java.nio.FloatBuffer;
|
||||
import java.nio.LongBuffer;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Paths;
|
||||
import java.text.Normalizer;
|
||||
import java.util.*;
|
||||
import java.util.regex.Pattern;
|
||||
import java.util.regex.Matcher;
|
||||
|
||||
/**
|
||||
* Available languages for multilingual TTS
|
||||
*/
|
||||
class Languages {
|
||||
public static final List<String> AVAILABLE = Arrays.asList("en", "ko", "es", "pt", "fr");
|
||||
|
||||
public static boolean isValid(String lang) {
|
||||
return AVAILABLE.contains(lang);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Configuration classes
|
||||
*/
|
||||
class Config {
|
||||
static class AEConfig {
|
||||
int sampleRate;
|
||||
int baseChunkSize;
|
||||
}
|
||||
|
||||
static class TTLConfig {
|
||||
int chunkCompressFactor;
|
||||
int latentDim;
|
||||
}
|
||||
|
||||
AEConfig ae;
|
||||
TTLConfig ttl;
|
||||
}
|
||||
|
||||
/**
|
||||
* Voice Style Data from JSON
|
||||
*/
|
||||
class VoiceStyleData {
|
||||
static class StyleData {
|
||||
float[][][] data;
|
||||
long[] dims;
|
||||
String type;
|
||||
}
|
||||
|
||||
StyleData styleTtl;
|
||||
StyleData styleDp;
|
||||
}
|
||||
|
||||
/**
|
||||
* Unicode text processor
|
||||
*/
|
||||
class UnicodeProcessor {
|
||||
private long[] indexer;
|
||||
|
||||
public UnicodeProcessor(String unicodeIndexerJsonPath) throws IOException {
|
||||
this.indexer = Helper.loadJsonLongArray(unicodeIndexerJsonPath);
|
||||
}
|
||||
|
||||
private static String removeEmojis(String text) {
|
||||
StringBuilder result = new StringBuilder();
|
||||
for (int i = 0; i < text.length(); i++) {
|
||||
int codePoint;
|
||||
if (Character.isHighSurrogate(text.charAt(i)) && i + 1 < text.length() && Character.isLowSurrogate(text.charAt(i + 1))) {
|
||||
codePoint = Character.codePointAt(text, i);
|
||||
i++; // Skip the low surrogate
|
||||
} else {
|
||||
codePoint = text.charAt(i);
|
||||
}
|
||||
|
||||
// Check if code point is in emoji ranges
|
||||
boolean isEmoji = (codePoint >= 0x1F600 && codePoint <= 0x1F64F) ||
|
||||
(codePoint >= 0x1F300 && codePoint <= 0x1F5FF) ||
|
||||
(codePoint >= 0x1F680 && codePoint <= 0x1F6FF) ||
|
||||
(codePoint >= 0x1F700 && codePoint <= 0x1F77F) ||
|
||||
(codePoint >= 0x1F780 && codePoint <= 0x1F7FF) ||
|
||||
(codePoint >= 0x1F800 && codePoint <= 0x1F8FF) ||
|
||||
(codePoint >= 0x1F900 && codePoint <= 0x1F9FF) ||
|
||||
(codePoint >= 0x1FA00 && codePoint <= 0x1FA6F) ||
|
||||
(codePoint >= 0x1FA70 && codePoint <= 0x1FAFF) ||
|
||||
(codePoint >= 0x2600 && codePoint <= 0x26FF) ||
|
||||
(codePoint >= 0x2700 && codePoint <= 0x27BF) ||
|
||||
(codePoint >= 0x1F1E6 && codePoint <= 0x1F1FF);
|
||||
|
||||
if (!isEmoji) {
|
||||
if (codePoint > 0xFFFF) {
|
||||
result.append(Character.toChars(codePoint));
|
||||
} else {
|
||||
result.append((char) codePoint);
|
||||
}
|
||||
}
|
||||
}
|
||||
return result.toString();
|
||||
}
|
||||
|
||||
public TextProcessResult call(List<String> textList, List<String> langList) {
|
||||
List<String> processedTexts = new ArrayList<>();
|
||||
for (int i = 0; i < textList.size(); i++) {
|
||||
processedTexts.add(preprocessText(textList.get(i), langList.get(i)));
|
||||
}
|
||||
|
||||
// Convert texts to unicode values first to get correct character counts
|
||||
List<int[]> allUnicodeVals = new ArrayList<>();
|
||||
for (String text : processedTexts) {
|
||||
allUnicodeVals.add(textToUnicodeValues(text));
|
||||
}
|
||||
|
||||
int[] textIdsLengths = new int[processedTexts.size()];
|
||||
int maxLen = 0;
|
||||
for (int i = 0; i < allUnicodeVals.size(); i++) {
|
||||
textIdsLengths[i] = allUnicodeVals.get(i).length; // Use code point count, not char count
|
||||
maxLen = Math.max(maxLen, textIdsLengths[i]);
|
||||
}
|
||||
|
||||
long[][] textIds = new long[processedTexts.size()][maxLen];
|
||||
for (int i = 0; i < allUnicodeVals.size(); i++) {
|
||||
int[] unicodeVals = allUnicodeVals.get(i);
|
||||
for (int j = 0; j < unicodeVals.length; j++) {
|
||||
textIds[i][j] = indexer[unicodeVals[j]];
|
||||
}
|
||||
}
|
||||
|
||||
float[][][] textMask = getTextMask(textIdsLengths);
|
||||
return new TextProcessResult(textIds, textMask);
|
||||
}
|
||||
|
||||
private String preprocessText(String text, String lang) {
|
||||
// TODO: Need advanced normalizer for better performance
|
||||
text = Normalizer.normalize(text, Normalizer.Form.NFKD);
|
||||
|
||||
// Remove emojis (wide Unicode range)
|
||||
// Java Pattern doesn't support \x{...} syntax for Unicode above \uFFFF
|
||||
// Use character filtering instead
|
||||
text = removeEmojis(text);
|
||||
|
||||
// Replace various dashes and symbols
|
||||
Map<String, String> replacements = new HashMap<>();
|
||||
replacements.put("–", "-"); // en dash
|
||||
replacements.put("‑", "-"); // non-breaking hyphen
|
||||
replacements.put("—", "-"); // em dash
|
||||
replacements.put("_", " "); // underscore
|
||||
replacements.put("\u201C", "\""); // left double quote
|
||||
replacements.put("\u201D", "\""); // right double quote
|
||||
replacements.put("\u2018", "'"); // left single quote
|
||||
replacements.put("\u2019", "'"); // right single quote
|
||||
replacements.put("´", "'"); // acute accent
|
||||
replacements.put("`", "'"); // grave accent
|
||||
replacements.put("[", " "); // left bracket
|
||||
replacements.put("]", " "); // right bracket
|
||||
replacements.put("|", " "); // vertical bar
|
||||
replacements.put("/", " "); // slash
|
||||
replacements.put("#", " "); // hash
|
||||
replacements.put("→", " "); // right arrow
|
||||
replacements.put("←", " "); // left arrow
|
||||
|
||||
for (Map.Entry<String, String> entry : replacements.entrySet()) {
|
||||
text = text.replace(entry.getKey(), entry.getValue());
|
||||
}
|
||||
|
||||
// Remove special symbols
|
||||
text = text.replaceAll("[♥☆♡©\\\\]", "");
|
||||
|
||||
// Replace known expressions
|
||||
Map<String, String> exprReplacements = new HashMap<>();
|
||||
exprReplacements.put("@", " at ");
|
||||
exprReplacements.put("e.g.,", "for example, ");
|
||||
exprReplacements.put("i.e.,", "that is, ");
|
||||
|
||||
for (Map.Entry<String, String> entry : exprReplacements.entrySet()) {
|
||||
text = text.replace(entry.getKey(), entry.getValue());
|
||||
}
|
||||
|
||||
// Fix spacing around punctuation
|
||||
text = text.replaceAll(" ,", ",");
|
||||
text = text.replaceAll(" \\.", ".");
|
||||
text = text.replaceAll(" !", "!");
|
||||
text = text.replaceAll(" \\?", "?");
|
||||
text = text.replaceAll(" ;", ";");
|
||||
text = text.replaceAll(" :", ":");
|
||||
text = text.replaceAll(" '", "'");
|
||||
|
||||
// Remove duplicate quotes
|
||||
while (text.contains("\"\"")) {
|
||||
text = text.replace("\"\"", "\"");
|
||||
}
|
||||
while (text.contains("''")) {
|
||||
text = text.replace("''", "'");
|
||||
}
|
||||
while (text.contains("``")) {
|
||||
text = text.replace("``", "`");
|
||||
}
|
||||
|
||||
// Remove extra spaces
|
||||
text = text.replaceAll("\\s+", " ").trim();
|
||||
|
||||
// If text doesn't end with punctuation, quotes, or closing brackets, add a period
|
||||
if (!text.matches(".*[.!?;:,'\"\\u201C\\u201D\\u2018\\u2019)\\]}…。」』】〉》›»]$")) {
|
||||
text += ".";
|
||||
}
|
||||
|
||||
// Validate language
|
||||
if (!Languages.isValid(lang)) {
|
||||
throw new IllegalArgumentException("Invalid language: " + lang + ". Available: " + Languages.AVAILABLE);
|
||||
}
|
||||
|
||||
// Wrap text with language tags
|
||||
text = "<" + lang + ">" + text + "</" + lang + ">";
|
||||
|
||||
return text;
|
||||
}
|
||||
|
||||
private int[] textToUnicodeValues(String text) {
|
||||
// Use codePoints() stream to correctly handle surrogate pairs
|
||||
return text.codePoints().toArray();
|
||||
}
|
||||
|
||||
private float[][][] getTextMask(int[] lengths) {
|
||||
int bsz = lengths.length;
|
||||
int maxLen = 0;
|
||||
for (int len : lengths) {
|
||||
maxLen = Math.max(maxLen, len);
|
||||
}
|
||||
|
||||
float[][][] mask = new float[bsz][1][maxLen];
|
||||
for (int i = 0; i < bsz; i++) {
|
||||
for (int j = 0; j < maxLen; j++) {
|
||||
mask[i][0][j] = j < lengths[i] ? 1.0f : 0.0f;
|
||||
}
|
||||
}
|
||||
return mask;
|
||||
}
|
||||
|
||||
static class TextProcessResult {
|
||||
long[][] textIds;
|
||||
float[][][] textMask;
|
||||
|
||||
TextProcessResult(long[][] textIds, float[][][] textMask) {
|
||||
this.textIds = textIds;
|
||||
this.textMask = textMask;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Text-to-Speech inference class
|
||||
*/
|
||||
class TextToSpeech {
|
||||
private Config config;
|
||||
private UnicodeProcessor textProcessor;
|
||||
private OrtSession dpSession;
|
||||
private OrtSession textEncSession;
|
||||
private OrtSession vectorEstSession;
|
||||
private OrtSession vocoderSession;
|
||||
public int sampleRate;
|
||||
private int baseChunkSize;
|
||||
private int chunkCompress;
|
||||
private int ldim;
|
||||
|
||||
public TextToSpeech(Config config, UnicodeProcessor textProcessor,
|
||||
OrtSession dpSession, OrtSession textEncSession,
|
||||
OrtSession vectorEstSession, OrtSession vocoderSession) {
|
||||
this.config = config;
|
||||
this.textProcessor = textProcessor;
|
||||
this.dpSession = dpSession;
|
||||
this.textEncSession = textEncSession;
|
||||
this.vectorEstSession = vectorEstSession;
|
||||
this.vocoderSession = vocoderSession;
|
||||
this.sampleRate = config.ae.sampleRate;
|
||||
this.baseChunkSize = config.ae.baseChunkSize;
|
||||
this.chunkCompress = config.ttl.chunkCompressFactor;
|
||||
this.ldim = config.ttl.latentDim;
|
||||
}
|
||||
|
||||
private TTSResult _infer(List<String> textList, List<String> langList, Style style, int totalStep, float speed, OrtEnvironment env)
|
||||
throws OrtException {
|
||||
int bsz = textList.size();
|
||||
|
||||
// Process text
|
||||
UnicodeProcessor.TextProcessResult textResult = textProcessor.call(textList, langList);
|
||||
long[][] textIds = textResult.textIds;
|
||||
float[][][] textMask = textResult.textMask;
|
||||
|
||||
// Create tensors
|
||||
OnnxTensor textIdsTensor = Helper.createLongTensor(textIds, env);
|
||||
OnnxTensor textMaskTensor = Helper.createFloatTensor(textMask, env);
|
||||
|
||||
// Predict duration
|
||||
Map<String, OnnxTensor> dpInputs = new HashMap<>();
|
||||
dpInputs.put("text_ids", textIdsTensor);
|
||||
dpInputs.put("style_dp", style.dpTensor);
|
||||
dpInputs.put("text_mask", textMaskTensor);
|
||||
|
||||
OrtSession.Result dpResult = dpSession.run(dpInputs);
|
||||
Object dpValue = dpResult.get(0).getValue();
|
||||
float[] duration;
|
||||
if (dpValue instanceof float[][]) {
|
||||
duration = ((float[][]) dpValue)[0];
|
||||
} else {
|
||||
duration = (float[]) dpValue;
|
||||
}
|
||||
|
||||
// Apply speed factor to duration
|
||||
for (int i = 0; i < duration.length; i++) {
|
||||
duration[i] /= speed;
|
||||
}
|
||||
|
||||
// Encode text
|
||||
Map<String, OnnxTensor> textEncInputs = new HashMap<>();
|
||||
textEncInputs.put("text_ids", textIdsTensor);
|
||||
textEncInputs.put("style_ttl", style.ttlTensor);
|
||||
textEncInputs.put("text_mask", textMaskTensor);
|
||||
|
||||
OrtSession.Result textEncResult = textEncSession.run(textEncInputs);
|
||||
OnnxTensor textEmbTensor = (OnnxTensor) textEncResult.get(0);
|
||||
|
||||
// Sample noisy latent
|
||||
NoisyLatentResult noisyLatentResult = sampleNoisyLatent(duration);
|
||||
float[][][] xt = noisyLatentResult.noisyLatent;
|
||||
float[][][] latentMask = noisyLatentResult.latentMask;
|
||||
|
||||
// Prepare constant tensors
|
||||
float[] totalStepArray = new float[bsz];
|
||||
Arrays.fill(totalStepArray, (float) totalStep);
|
||||
OnnxTensor totalStepTensor = OnnxTensor.createTensor(env, totalStepArray);
|
||||
|
||||
// Denoising loop
|
||||
for (int step = 0; step < totalStep; step++) {
|
||||
float[] currentStepArray = new float[bsz];
|
||||
Arrays.fill(currentStepArray, (float) step);
|
||||
OnnxTensor currentStepTensor = OnnxTensor.createTensor(env, currentStepArray);
|
||||
OnnxTensor noisyLatentTensor = Helper.createFloatTensor(xt, env);
|
||||
OnnxTensor latentMaskTensor = Helper.createFloatTensor(latentMask, env);
|
||||
OnnxTensor textMaskTensor2 = Helper.createFloatTensor(textMask, env);
|
||||
|
||||
Map<String, OnnxTensor> vectorEstInputs = new HashMap<>();
|
||||
vectorEstInputs.put("noisy_latent", noisyLatentTensor);
|
||||
vectorEstInputs.put("text_emb", textEmbTensor);
|
||||
vectorEstInputs.put("style_ttl", style.ttlTensor);
|
||||
vectorEstInputs.put("latent_mask", latentMaskTensor);
|
||||
vectorEstInputs.put("text_mask", textMaskTensor2);
|
||||
vectorEstInputs.put("current_step", currentStepTensor);
|
||||
vectorEstInputs.put("total_step", totalStepTensor);
|
||||
|
||||
OrtSession.Result vectorEstResult = vectorEstSession.run(vectorEstInputs);
|
||||
float[][][] denoised = (float[][][]) vectorEstResult.get(0).getValue();
|
||||
|
||||
// Update latent
|
||||
xt = denoised;
|
||||
|
||||
// Clean up
|
||||
currentStepTensor.close();
|
||||
noisyLatentTensor.close();
|
||||
latentMaskTensor.close();
|
||||
textMaskTensor2.close();
|
||||
vectorEstResult.close();
|
||||
}
|
||||
|
||||
// Generate waveform
|
||||
OnnxTensor finalLatentTensor = Helper.createFloatTensor(xt, env);
|
||||
Map<String, OnnxTensor> vocoderInputs = new HashMap<>();
|
||||
vocoderInputs.put("latent", finalLatentTensor);
|
||||
|
||||
OrtSession.Result vocoderResult = vocoderSession.run(vocoderInputs);
|
||||
float[][] wavBatch = (float[][]) vocoderResult.get(0).getValue();
|
||||
|
||||
// Flatten all batch audio into a single array for batch processing
|
||||
int totalSamples = 0;
|
||||
for (float[] w : wavBatch) {
|
||||
totalSamples += w.length;
|
||||
}
|
||||
float[] wav = new float[totalSamples];
|
||||
int offset = 0;
|
||||
for (float[] w : wavBatch) {
|
||||
System.arraycopy(w, 0, wav, offset, w.length);
|
||||
offset += w.length;
|
||||
}
|
||||
|
||||
// Clean up
|
||||
textIdsTensor.close();
|
||||
textMaskTensor.close();
|
||||
dpResult.close();
|
||||
textEncResult.close();
|
||||
totalStepTensor.close();
|
||||
finalLatentTensor.close();
|
||||
vocoderResult.close();
|
||||
|
||||
return new TTSResult(wav, duration);
|
||||
}
|
||||
|
||||
private NoisyLatentResult sampleNoisyLatent(float[] duration) {
|
||||
int bsz = duration.length;
|
||||
float maxDur = 0;
|
||||
for (float d : duration) {
|
||||
maxDur = Math.max(maxDur, d);
|
||||
}
|
||||
|
||||
long wavLenMax = (long) (maxDur * sampleRate);
|
||||
long[] wavLengths = new long[bsz];
|
||||
for (int i = 0; i < bsz; i++) {
|
||||
wavLengths[i] = (long) (duration[i] * sampleRate);
|
||||
}
|
||||
|
||||
int chunkSize = baseChunkSize * chunkCompress;
|
||||
int latentLen = (int) ((wavLenMax + chunkSize - 1) / chunkSize);
|
||||
int latentDim = ldim * chunkCompress;
|
||||
|
||||
Random rng = new Random();
|
||||
float[][][] noisyLatent = new float[bsz][latentDim][latentLen];
|
||||
for (int b = 0; b < bsz; b++) {
|
||||
for (int d = 0; d < latentDim; d++) {
|
||||
for (int t = 0; t < latentLen; t++) {
|
||||
// Box-Muller transform
|
||||
double u1 = Math.max(1e-10, rng.nextDouble());
|
||||
double u2 = rng.nextDouble();
|
||||
noisyLatent[b][d][t] = (float) (Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
float[][][] latentMask = Helper.getLatentMask(wavLengths, config);
|
||||
|
||||
// Apply mask
|
||||
for (int b = 0; b < bsz; b++) {
|
||||
for (int d = 0; d < latentDim; d++) {
|
||||
for (int t = 0; t < latentLen; t++) {
|
||||
noisyLatent[b][d][t] *= latentMask[b][0][t];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return new NoisyLatentResult(noisyLatent, latentMask);
|
||||
}
|
||||
|
||||
/**
|
||||
* Synthesize speech from a single text with automatic chunking
|
||||
*/
|
||||
public TTSResult call(String text, String lang, Style style, int totalStep, float speed, float silenceDuration, OrtEnvironment env)
|
||||
throws OrtException {
|
||||
int maxLen = lang.equals("ko") ? 120 : 300;
|
||||
List<String> chunks = Helper.chunkText(text, maxLen);
|
||||
|
||||
List<Float> wavCat = new ArrayList<>();
|
||||
float durCat = 0.0f;
|
||||
|
||||
for (int i = 0; i < chunks.size(); i++) {
|
||||
TTSResult result = _infer(Arrays.asList(chunks.get(i)), Arrays.asList(lang), style, totalStep, speed, env);
|
||||
|
||||
float dur = result.duration[0];
|
||||
int wavLen = (int) (sampleRate * dur);
|
||||
float[] wavChunk = new float[wavLen];
|
||||
System.arraycopy(result.wav, 0, wavChunk, 0, Math.min(wavLen, result.wav.length));
|
||||
|
||||
if (i == 0) {
|
||||
for (float val : wavChunk) {
|
||||
wavCat.add(val);
|
||||
}
|
||||
durCat = dur;
|
||||
} else {
|
||||
int silenceLen = (int) (silenceDuration * sampleRate);
|
||||
for (int j = 0; j < silenceLen; j++) {
|
||||
wavCat.add(0.0f);
|
||||
}
|
||||
for (float val : wavChunk) {
|
||||
wavCat.add(val);
|
||||
}
|
||||
durCat += silenceDuration + dur;
|
||||
}
|
||||
}
|
||||
|
||||
float[] wavArray = new float[wavCat.size()];
|
||||
for (int i = 0; i < wavCat.size(); i++) {
|
||||
wavArray[i] = wavCat.get(i);
|
||||
}
|
||||
|
||||
return new TTSResult(wavArray, new float[]{durCat});
|
||||
}
|
||||
|
||||
/**
|
||||
* Batch synthesize speech from multiple texts
|
||||
*/
|
||||
public TTSResult batch(List<String> textList, List<String> langList, Style style, int totalStep, float speed, OrtEnvironment env)
|
||||
throws OrtException {
|
||||
return _infer(textList, langList, style, totalStep, speed, env);
|
||||
}
|
||||
|
||||
public void close() throws OrtException {
|
||||
if (dpSession != null) dpSession.close();
|
||||
if (textEncSession != null) textEncSession.close();
|
||||
if (vectorEstSession != null) vectorEstSession.close();
|
||||
if (vocoderSession != null) vocoderSession.close();
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Style holder class
|
||||
*/
|
||||
class Style {
|
||||
OnnxTensor ttlTensor;
|
||||
OnnxTensor dpTensor;
|
||||
|
||||
Style(OnnxTensor ttlTensor, OnnxTensor dpTensor) {
|
||||
this.ttlTensor = ttlTensor;
|
||||
this.dpTensor = dpTensor;
|
||||
}
|
||||
|
||||
public void close() throws OrtException {
|
||||
if (ttlTensor != null) ttlTensor.close();
|
||||
if (dpTensor != null) dpTensor.close();
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* TTS result holder
|
||||
*/
|
||||
class TTSResult {
|
||||
float[] wav;
|
||||
float[] duration;
|
||||
|
||||
TTSResult(float[] wav, float[] duration) {
|
||||
this.wav = wav;
|
||||
this.duration = duration;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Noisy latent result holder
|
||||
*/
|
||||
class NoisyLatentResult {
|
||||
float[][][] noisyLatent;
|
||||
float[][][] latentMask;
|
||||
|
||||
NoisyLatentResult(float[][][] noisyLatent, float[][][] latentMask) {
|
||||
this.noisyLatent = noisyLatent;
|
||||
this.latentMask = latentMask;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Helper utility class
|
||||
*/
|
||||
public class Helper {
|
||||
|
||||
private static final int MAX_CHUNK_LENGTH = 300;
|
||||
private static final String[] ABBREVIATIONS = {
|
||||
"Dr.", "Mr.", "Mrs.", "Ms.", "Prof.", "Sr.", "Jr.",
|
||||
"St.", "Ave.", "Rd.", "Blvd.", "Dept.", "Inc.", "Ltd.",
|
||||
"Co.", "Corp.", "etc.", "vs.", "i.e.", "e.g.", "Ph.D."
|
||||
};
|
||||
|
||||
/**
|
||||
* Chunk text into smaller segments based on paragraphs and sentences
|
||||
*/
|
||||
public static List<String> chunkText(String text, int maxLen) {
|
||||
if (maxLen == 0) {
|
||||
maxLen = MAX_CHUNK_LENGTH;
|
||||
}
|
||||
|
||||
text = text.trim();
|
||||
if (text.isEmpty()) {
|
||||
return Arrays.asList("");
|
||||
}
|
||||
|
||||
// Split by paragraphs
|
||||
String[] paragraphs = text.split("\\n\\s*\\n");
|
||||
List<String> chunks = new ArrayList<>();
|
||||
|
||||
for (String para : paragraphs) {
|
||||
para = para.trim();
|
||||
if (para.isEmpty()) {
|
||||
continue;
|
||||
}
|
||||
|
||||
if (para.length() <= maxLen) {
|
||||
chunks.add(para);
|
||||
continue;
|
||||
}
|
||||
|
||||
// Split by sentences
|
||||
List<String> sentences = splitSentences(para);
|
||||
StringBuilder current = new StringBuilder();
|
||||
int currentLen = 0;
|
||||
|
||||
for (String sentence : sentences) {
|
||||
sentence = sentence.trim();
|
||||
if (sentence.isEmpty()) {
|
||||
continue;
|
||||
}
|
||||
|
||||
int sentenceLen = sentence.length();
|
||||
if (sentenceLen > maxLen) {
|
||||
// If sentence is longer than maxLen, split by comma or space
|
||||
if (current.length() > 0) {
|
||||
chunks.add(current.toString().trim());
|
||||
current.setLength(0);
|
||||
currentLen = 0;
|
||||
}
|
||||
|
||||
// Try splitting by comma
|
||||
String[] parts = sentence.split(",");
|
||||
for (String part : parts) {
|
||||
part = part.trim();
|
||||
if (part.isEmpty()) {
|
||||
continue;
|
||||
}
|
||||
|
||||
int partLen = part.length();
|
||||
if (partLen > maxLen) {
|
||||
// Split by space as last resort
|
||||
String[] words = part.split("\\s+");
|
||||
StringBuilder wordChunk = new StringBuilder();
|
||||
int wordChunkLen = 0;
|
||||
|
||||
for (String word : words) {
|
||||
int wordLen = word.length();
|
||||
if (wordChunkLen + wordLen + 1 > maxLen && wordChunk.length() > 0) {
|
||||
chunks.add(wordChunk.toString().trim());
|
||||
wordChunk.setLength(0);
|
||||
wordChunkLen = 0;
|
||||
}
|
||||
|
||||
if (wordChunk.length() > 0) {
|
||||
wordChunk.append(" ");
|
||||
wordChunkLen++;
|
||||
}
|
||||
wordChunk.append(word);
|
||||
wordChunkLen += wordLen;
|
||||
}
|
||||
|
||||
if (wordChunk.length() > 0) {
|
||||
chunks.add(wordChunk.toString().trim());
|
||||
}
|
||||
} else {
|
||||
if (currentLen + partLen + 1 > maxLen && current.length() > 0) {
|
||||
chunks.add(current.toString().trim());
|
||||
current.setLength(0);
|
||||
currentLen = 0;
|
||||
}
|
||||
|
||||
if (current.length() > 0) {
|
||||
current.append(", ");
|
||||
currentLen += 2;
|
||||
}
|
||||
current.append(part);
|
||||
currentLen += partLen;
|
||||
}
|
||||
}
|
||||
continue;
|
||||
}
|
||||
|
||||
if (currentLen + sentenceLen + 1 > maxLen && current.length() > 0) {
|
||||
chunks.add(current.toString().trim());
|
||||
current.setLength(0);
|
||||
currentLen = 0;
|
||||
}
|
||||
|
||||
if (current.length() > 0) {
|
||||
current.append(" ");
|
||||
currentLen++;
|
||||
}
|
||||
current.append(sentence);
|
||||
currentLen += sentenceLen;
|
||||
}
|
||||
|
||||
if (current.length() > 0) {
|
||||
chunks.add(current.toString().trim());
|
||||
}
|
||||
}
|
||||
|
||||
if (chunks.isEmpty()) {
|
||||
return Arrays.asList("");
|
||||
}
|
||||
|
||||
return chunks;
|
||||
}
|
||||
|
||||
/**
|
||||
* Split text into sentences, avoiding common abbreviations
|
||||
*/
|
||||
private static List<String> splitSentences(String text) {
|
||||
// Build pattern that avoids abbreviations
|
||||
StringBuilder abbrevPattern = new StringBuilder();
|
||||
for (int i = 0; i < ABBREVIATIONS.length; i++) {
|
||||
if (i > 0) abbrevPattern.append("|");
|
||||
abbrevPattern.append(Pattern.quote(ABBREVIATIONS[i]));
|
||||
}
|
||||
|
||||
// Match sentence endings, but not abbreviations
|
||||
String patternStr = "(?<!(?:" + abbrevPattern.toString() + "))(?<=[.!?])\\s+";
|
||||
Pattern pattern = Pattern.compile(patternStr);
|
||||
return Arrays.asList(pattern.split(text));
|
||||
}
|
||||
|
||||
/**
|
||||
* Load voice style from JSON files
|
||||
*/
|
||||
public static Style loadVoiceStyle(List<String> voiceStylePaths, boolean verbose, OrtEnvironment env)
|
||||
throws IOException, OrtException {
|
||||
int bsz = voiceStylePaths.size();
|
||||
|
||||
// Read first file to get dimensions
|
||||
ObjectMapper mapper = new ObjectMapper();
|
||||
JsonNode firstRoot = mapper.readTree(new File(voiceStylePaths.get(0)));
|
||||
|
||||
long[] ttlDims = new long[3];
|
||||
for (int i = 0; i < 3; i++) {
|
||||
ttlDims[i] = firstRoot.get("style_ttl").get("dims").get(i).asLong();
|
||||
}
|
||||
long[] dpDims = new long[3];
|
||||
for (int i = 0; i < 3; i++) {
|
||||
dpDims[i] = firstRoot.get("style_dp").get("dims").get(i).asLong();
|
||||
}
|
||||
|
||||
long ttlDim1 = ttlDims[1];
|
||||
long ttlDim2 = ttlDims[2];
|
||||
long dpDim1 = dpDims[1];
|
||||
long dpDim2 = dpDims[2];
|
||||
|
||||
// Pre-allocate arrays with full batch size
|
||||
int ttlSize = (int) (bsz * ttlDim1 * ttlDim2);
|
||||
int dpSize = (int) (bsz * dpDim1 * dpDim2);
|
||||
float[] ttlFlat = new float[ttlSize];
|
||||
float[] dpFlat = new float[dpSize];
|
||||
|
||||
// Fill in the data
|
||||
for (int i = 0; i < bsz; i++) {
|
||||
JsonNode root = mapper.readTree(new File(voiceStylePaths.get(i)));
|
||||
|
||||
// Flatten TTL data
|
||||
int ttlOffset = (int) (i * ttlDim1 * ttlDim2);
|
||||
int idx = 0;
|
||||
JsonNode ttlData = root.get("style_ttl").get("data");
|
||||
for (JsonNode batch : ttlData) {
|
||||
for (JsonNode row : batch) {
|
||||
for (JsonNode val : row) {
|
||||
ttlFlat[ttlOffset + idx++] = (float) val.asDouble();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Flatten DP data
|
||||
int dpOffset = (int) (i * dpDim1 * dpDim2);
|
||||
idx = 0;
|
||||
JsonNode dpData = root.get("style_dp").get("data");
|
||||
for (JsonNode batch : dpData) {
|
||||
for (JsonNode row : batch) {
|
||||
for (JsonNode val : row) {
|
||||
dpFlat[dpOffset + idx++] = (float) val.asDouble();
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
long[] ttlShape = {bsz, ttlDim1, ttlDim2};
|
||||
long[] dpShape = {bsz, dpDim1, dpDim2};
|
||||
|
||||
OnnxTensor ttlTensor = OnnxTensor.createTensor(env, FloatBuffer.wrap(ttlFlat), ttlShape);
|
||||
OnnxTensor dpTensor = OnnxTensor.createTensor(env, FloatBuffer.wrap(dpFlat), dpShape);
|
||||
|
||||
if (verbose) {
|
||||
System.out.println("Loaded " + bsz + " voice styles\n");
|
||||
}
|
||||
|
||||
return new Style(ttlTensor, dpTensor);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load TTS components
|
||||
*/
|
||||
public static TextToSpeech loadTextToSpeech(String onnxDir, boolean useGpu, OrtEnvironment env)
|
||||
throws IOException, OrtException {
|
||||
if (useGpu) {
|
||||
throw new RuntimeException("GPU mode is not supported yet");
|
||||
}
|
||||
System.out.println("Using CPU for inference\n");
|
||||
|
||||
// Load config
|
||||
Config config = loadCfgs(onnxDir);
|
||||
|
||||
// Create session options
|
||||
OrtSession.SessionOptions opts = new OrtSession.SessionOptions();
|
||||
|
||||
// Load models
|
||||
OrtSession dpSession = env.createSession(onnxDir + "/duration_predictor.onnx", opts);
|
||||
OrtSession textEncSession = env.createSession(onnxDir + "/text_encoder.onnx", opts);
|
||||
OrtSession vectorEstSession = env.createSession(onnxDir + "/vector_estimator.onnx", opts);
|
||||
OrtSession vocoderSession = env.createSession(onnxDir + "/vocoder.onnx", opts);
|
||||
|
||||
// Load text processor
|
||||
UnicodeProcessor textProcessor = new UnicodeProcessor(onnxDir + "/unicode_indexer.json");
|
||||
|
||||
return new TextToSpeech(config, textProcessor, dpSession, textEncSession, vectorEstSession, vocoderSession);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load configuration from JSON
|
||||
*/
|
||||
public static Config loadCfgs(String onnxDir) throws IOException {
|
||||
ObjectMapper mapper = new ObjectMapper();
|
||||
JsonNode root = mapper.readTree(new File(onnxDir + "/tts.json"));
|
||||
|
||||
Config config = new Config();
|
||||
config.ae = new Config.AEConfig();
|
||||
config.ae.sampleRate = root.get("ae").get("sample_rate").asInt();
|
||||
config.ae.baseChunkSize = root.get("ae").get("base_chunk_size").asInt();
|
||||
|
||||
config.ttl = new Config.TTLConfig();
|
||||
config.ttl.chunkCompressFactor = root.get("ttl").get("chunk_compress_factor").asInt();
|
||||
config.ttl.latentDim = root.get("ttl").get("latent_dim").asInt();
|
||||
|
||||
return config;
|
||||
}
|
||||
|
||||
/**
|
||||
* Get latent mask from wav lengths
|
||||
*/
|
||||
public static float[][][] getLatentMask(long[] wavLengths, Config config) {
|
||||
long baseChunkSize = config.ae.baseChunkSize;
|
||||
long chunkCompressFactor = config.ttl.chunkCompressFactor;
|
||||
long latentSize = baseChunkSize * chunkCompressFactor;
|
||||
|
||||
long[] latentLengths = new long[wavLengths.length];
|
||||
long maxLen = 0;
|
||||
for (int i = 0; i < wavLengths.length; i++) {
|
||||
latentLengths[i] = (wavLengths[i] + latentSize - 1) / latentSize;
|
||||
maxLen = Math.max(maxLen, latentLengths[i]);
|
||||
}
|
||||
|
||||
float[][][] mask = new float[wavLengths.length][1][(int) maxLen];
|
||||
for (int i = 0; i < wavLengths.length; i++) {
|
||||
for (int j = 0; j < maxLen; j++) {
|
||||
mask[i][0][j] = j < latentLengths[i] ? 1.0f : 0.0f;
|
||||
}
|
||||
}
|
||||
return mask;
|
||||
}
|
||||
|
||||
/**
|
||||
* Write WAV file
|
||||
*/
|
||||
public static void writeWavFile(String filename, float[] audioData, int sampleRate) throws IOException {
|
||||
// Convert float to byte array
|
||||
byte[] bytes = new byte[audioData.length * 2];
|
||||
ByteBuffer buffer = ByteBuffer.wrap(bytes);
|
||||
buffer.order(ByteOrder.LITTLE_ENDIAN);
|
||||
|
||||
for (float sample : audioData) {
|
||||
short val = (short) Math.max(-32768, Math.min(32767, sample * 32767));
|
||||
buffer.putShort(val);
|
||||
}
|
||||
|
||||
ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
|
||||
AudioFormat format = new AudioFormat(sampleRate, 16, 1, true, false);
|
||||
AudioInputStream ais = new AudioInputStream(bais, format, audioData.length);
|
||||
AudioSystem.write(ais, AudioFileFormat.Type.WAVE, new File(filename));
|
||||
}
|
||||
|
||||
/**
|
||||
* Sanitize filename (supports Unicode characters)
|
||||
*/
|
||||
public static String sanitizeFilename(String text, int maxLen) {
|
||||
// Get first maxLen characters (code points, not chars for surrogate pairs)
|
||||
int[] codePoints = text.codePoints().limit(maxLen).toArray();
|
||||
StringBuilder result = new StringBuilder();
|
||||
for (int codePoint : codePoints) {
|
||||
if (Character.isLetterOrDigit(codePoint)) {
|
||||
result.appendCodePoint(codePoint);
|
||||
} else {
|
||||
result.append('_');
|
||||
}
|
||||
}
|
||||
return result.toString();
|
||||
}
|
||||
|
||||
/**
|
||||
* Timer utility
|
||||
*/
|
||||
public static <T> T timer(String name, java.util.function.Supplier<T> fn) {
|
||||
long start = System.currentTimeMillis();
|
||||
System.out.println(name + "...");
|
||||
T result = fn.get();
|
||||
long elapsed = System.currentTimeMillis() - start;
|
||||
System.out.printf(" -> %s completed in %.2f sec\n", name, elapsed / 1000.0);
|
||||
return result;
|
||||
}
|
||||
|
||||
/**
|
||||
* Create float tensor from 3D array
|
||||
*/
|
||||
public static OnnxTensor createFloatTensor(float[][][] array, OrtEnvironment env) throws OrtException {
|
||||
int dim0 = array.length;
|
||||
int dim1 = array[0].length;
|
||||
int dim2 = array[0][0].length;
|
||||
|
||||
float[] flat = new float[dim0 * dim1 * dim2];
|
||||
int idx = 0;
|
||||
for (int i = 0; i < dim0; i++) {
|
||||
for (int j = 0; j < dim1; j++) {
|
||||
for (int k = 0; k < dim2; k++) {
|
||||
flat[idx++] = array[i][j][k];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
long[] shape = {dim0, dim1, dim2};
|
||||
return OnnxTensor.createTensor(env, FloatBuffer.wrap(flat), shape);
|
||||
}
|
||||
|
||||
/**
|
||||
* Create long tensor from 2D array
|
||||
*/
|
||||
public static OnnxTensor createLongTensor(long[][] array, OrtEnvironment env) throws OrtException {
|
||||
int dim0 = array.length;
|
||||
int dim1 = array[0].length;
|
||||
|
||||
long[] flat = new long[dim0 * dim1];
|
||||
int idx = 0;
|
||||
for (int i = 0; i < dim0; i++) {
|
||||
for (int j = 0; j < dim1; j++) {
|
||||
flat[idx++] = array[i][j];
|
||||
}
|
||||
}
|
||||
|
||||
long[] shape = {dim0, dim1};
|
||||
return OnnxTensor.createTensor(env, LongBuffer.wrap(flat), shape);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load JSON long array
|
||||
*/
|
||||
public static long[] loadJsonLongArray(String filePath) throws IOException {
|
||||
ObjectMapper mapper = new ObjectMapper();
|
||||
JsonNode root = mapper.readTree(new File(filePath));
|
||||
|
||||
long[] result = new long[root.size()];
|
||||
for (int i = 0; i < root.size(); i++) {
|
||||
result[i] = root.get(i).asLong();
|
||||
}
|
||||
return result;
|
||||
}
|
||||
}
|
||||
|
||||
130
java/README.md
Normal file
@@ -0,0 +1,130 @@
|
||||
# TTS ONNX Inference Examples
|
||||
|
||||
This guide provides examples for running TTS inference using `ExampleONNX.java`.
|
||||
|
||||
## 📰 Update News
|
||||
|
||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
||||
|
||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||
|
||||
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
|
||||
|
||||
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
|
||||
|
||||
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
|
||||
|
||||
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
|
||||
|
||||
## Installation
|
||||
|
||||
This project uses [Maven](https://maven.apache.org/) for dependency management.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Java 11 or higher
|
||||
- Maven 3.6 or higher
|
||||
|
||||
### Install dependencies
|
||||
|
||||
```bash
|
||||
mvn clean install
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
mvn exec:java
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
mvn exec:java -Dexec.args="--batch --voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json --text 'The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요.' --lang en,ko"
|
||||
```
|
||||
|
||||
This will:
|
||||
- Generate speech for 2 different voice-text-language pairs
|
||||
- Use male voice (M1.json) for the first text in English
|
||||
- Use female voice (F1.json) for the second text in Korean
|
||||
- Process both samples in a single batch
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
mvn exec:java -Dexec.args="--total-step 10 --voice-style assets/voice_styles/M1.json --text 'Increasing the number of denoising steps improves the output fidelity and overall quality.'"
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
### Example 4: Long-Form Inference
|
||||
The system automatically chunks long texts into manageable segments, synthesizes each segment separately, and concatenates them with natural pauses (0.3 seconds by default) into a single audio file. This happens by default when you don't use the `--batch` flag:
|
||||
|
||||
```bash
|
||||
mvn exec:java -Dexec.args="--voice-style assets/voice_styles/M1.json --text 'This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues.'"
|
||||
```
|
||||
|
||||
This will:
|
||||
- Automatically split the text into chunks based on paragraph and sentence boundaries
|
||||
- Synthesize each chunk separately
|
||||
- Add 0.3 seconds of silence between chunks for natural pauses
|
||||
- Concatenate all chunks into a single audio file
|
||||
|
||||
**Note**: Automatic text chunking is disabled when using `--batch` mode. In batch mode, each text is processed as-is without chunking.
|
||||
|
||||
**Tip**: If your text contains apostrophes, use escaping or run the JAR directly:
|
||||
```bash
|
||||
java -jar target/tts-example.jar --total-step 10 --text "Text with apostrophe's here"
|
||||
```
|
||||
|
||||
## Building a Fat JAR
|
||||
|
||||
To create a standalone JAR with all dependencies:
|
||||
```bash
|
||||
mvn clean package
|
||||
```
|
||||
|
||||
Then run it directly:
|
||||
```bash
|
||||
java -jar target/tts-example.jar
|
||||
```
|
||||
|
||||
Or with arguments:
|
||||
```bash
|
||||
java -jar target/tts-example.jar --total-step 10 --text "Your custom text here"
|
||||
```
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
|
||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s), comma-separated |
|
||||
| `--text` | str+ | (long default text) | Text(s) to synthesize, pipe-separated |
|
||||
| `--lang` | str+ | `en` | Language(s) for synthesis, comma-separated (en, ko, es, pt, fr) |
|
||||
| `--save-dir` | str | `results` | Output directory |
|
||||
| `--batch` | flag | False | Enable batch mode (multiple text-style pairs, disables automatic chunking) |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Multilingual Support**: Use `--lang` to specify the language for each text. Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
|
||||
- **Batch Processing**: When using `--batch`, the number of `--voice-style`, `--text`, and `--lang` entries must match
|
||||
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
|
||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||
- **GPU Support**: GPU mode is not supported yet
|
||||
- **Voice Styles**: Uses pre-extracted voice style JSON files for fast inference
|
||||
|
||||
110
java/pom.xml
Normal file
@@ -0,0 +1,110 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<project xmlns="http://maven.apache.org/POM/4.0.0"
|
||||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
|
||||
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
|
||||
http://maven.apache.org/xsd/maven-4.0.0.xsd">
|
||||
<modelVersion>4.0.0</modelVersion>
|
||||
|
||||
<groupId>ai.supertonic</groupId>
|
||||
<artifactId>tts-onnx-java</artifactId>
|
||||
<version>1.0.0</version>
|
||||
<packaging>jar</packaging>
|
||||
|
||||
<name>TTS ONNX Java Example</name>
|
||||
<description>Text-to-Speech inference using ONNX Runtime in Java</description>
|
||||
|
||||
<properties>
|
||||
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
|
||||
<maven.compiler.source>11</maven.compiler.source>
|
||||
<maven.compiler.target>11</maven.compiler.target>
|
||||
<onnxruntime.version>1.23.1</onnxruntime.version>
|
||||
<jackson.version>2.15.2</jackson.version>
|
||||
</properties>
|
||||
|
||||
<dependencies>
|
||||
<!-- ONNX Runtime -->
|
||||
<dependency>
|
||||
<groupId>com.microsoft.onnxruntime</groupId>
|
||||
<artifactId>onnxruntime</artifactId>
|
||||
<version>${onnxruntime.version}</version>
|
||||
</dependency>
|
||||
|
||||
<!-- Jackson for JSON parsing -->
|
||||
<dependency>
|
||||
<groupId>com.fasterxml.jackson.core</groupId>
|
||||
<artifactId>jackson-databind</artifactId>
|
||||
<version>${jackson.version}</version>
|
||||
</dependency>
|
||||
|
||||
<!-- JTransforms for Fast FFT -->
|
||||
<dependency>
|
||||
<groupId>com.github.wendykierp</groupId>
|
||||
<artifactId>JTransforms</artifactId>
|
||||
<version>3.1</version>
|
||||
</dependency>
|
||||
</dependencies>
|
||||
|
||||
<build>
|
||||
<sourceDirectory>.</sourceDirectory>
|
||||
<plugins>
|
||||
<!-- Maven Compiler Plugin -->
|
||||
<plugin>
|
||||
<groupId>org.apache.maven.plugins</groupId>
|
||||
<artifactId>maven-compiler-plugin</artifactId>
|
||||
<version>3.11.0</version>
|
||||
<configuration>
|
||||
<source>11</source>
|
||||
<target>11</target>
|
||||
</configuration>
|
||||
</plugin>
|
||||
|
||||
<!-- Maven Exec Plugin for running the example -->
|
||||
<plugin>
|
||||
<groupId>org.codehaus.mojo</groupId>
|
||||
<artifactId>exec-maven-plugin</artifactId>
|
||||
<version>3.1.0</version>
|
||||
<configuration>
|
||||
<mainClass>ExampleONNX</mainClass>
|
||||
</configuration>
|
||||
</plugin>
|
||||
|
||||
<!-- Maven Jar Plugin -->
|
||||
<plugin>
|
||||
<groupId>org.apache.maven.plugins</groupId>
|
||||
<artifactId>maven-jar-plugin</artifactId>
|
||||
<version>3.3.0</version>
|
||||
<configuration>
|
||||
<archive>
|
||||
<manifest>
|
||||
<mainClass>ExampleONNX</mainClass>
|
||||
</manifest>
|
||||
</archive>
|
||||
</configuration>
|
||||
</plugin>
|
||||
|
||||
<!-- Maven Shade Plugin for creating fat JAR -->
|
||||
<plugin>
|
||||
<groupId>org.apache.maven.plugins</groupId>
|
||||
<artifactId>maven-shade-plugin</artifactId>
|
||||
<version>3.5.0</version>
|
||||
<executions>
|
||||
<execution>
|
||||
<phase>package</phase>
|
||||
<goals>
|
||||
<goal>shade</goal>
|
||||
</goals>
|
||||
<configuration>
|
||||
<transformers>
|
||||
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
|
||||
<mainClass>ExampleONNX</mainClass>
|
||||
</transformer>
|
||||
</transformers>
|
||||
<finalName>tts-example</finalName>
|
||||
</configuration>
|
||||
</execution>
|
||||
</executions>
|
||||
</plugin>
|
||||
</plugins>
|
||||
</build>
|
||||
</project>
|
||||
|
||||
140
nodejs/README.md
Normal file
@@ -0,0 +1,140 @@
|
||||
# TTS ONNX Node.js Implementation
|
||||
|
||||
Node.js implementation for TTS inference. Uses ONNX Runtime to generate speech from text.
|
||||
|
||||
## 📰 Update News
|
||||
|
||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
||||
|
||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||
|
||||
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
|
||||
|
||||
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
|
||||
|
||||
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
|
||||
|
||||
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
|
||||
|
||||
## Requirements
|
||||
|
||||
- Node.js v16 or higher
|
||||
- npm or yarn
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
cd nodejs
|
||||
npm install
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
npm start
|
||||
```
|
||||
|
||||
Or:
|
||||
```bash
|
||||
node example_onnx.js
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
node example_onnx.js \
|
||||
--voice-style "assets/voice_styles/M1.json,assets/voice_styles/F1.json" \
|
||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
|
||||
--lang "en,ko" \
|
||||
--batch
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use `--batch` flag to enable batch processing mode
|
||||
- Generate speech for 2 different voice-text pairs
|
||||
- Use male voice style (M1.json) for the first English text
|
||||
- Use female voice style (F1.json) for the second Korean text
|
||||
- Process both samples in a single batch (automatic text chunking disabled)
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
node example_onnx.js \
|
||||
--total-step 10 \
|
||||
--voice-style "assets/voice_styles/M1.json" \
|
||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
### Example 4: Long-Form Inference
|
||||
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
|
||||
```bash
|
||||
node example_onnx.js \
|
||||
--voice-style "assets/voice_styles/M1.json" \
|
||||
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Automatically split the long text into smaller chunks (max 300 characters by default)
|
||||
- Process each chunk separately while maintaining natural speech flow
|
||||
- Insert brief silences (0.3 seconds) between chunks for natural pacing
|
||||
- Combine all chunks into a single output audio file
|
||||
|
||||
**Note**: When using batch mode (`--batch`), automatic text chunking is disabled. Use non-batch mode for long-form text synthesis.
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `--use-gpu` | flag | False | Use GPU for inference (not supported yet) |
|
||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
|
||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s). Separate multiple files with commas |
|
||||
| `--text` | str+ | (long default text) | Text(s) to synthesize. Separate multiple texts with pipes |
|
||||
| `--lang` | str+ | `en` | Language(s) for text(s): `en`, `ko`, `es`, `pt`, `fr`. Separate multiple with commas |
|
||||
| `--save-dir` | str | `results` | Output directory |
|
||||
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Batch Processing**: The number of voice style files must match the number of texts. Use commas to separate files and pipes to separate texts
|
||||
- **Multilingual Support**: Use `--lang` to specify language(s). Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
|
||||
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
|
||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||
- **GPU Support**: GPU mode is not supported yet
|
||||
|
||||
## Architecture
|
||||
|
||||
- `helper.js`: Node.js port of Python's `helper.py`
|
||||
- `Preprocessor`: Audio preprocessing (STFT, Mel Spectrogram)
|
||||
- `UnicodeProcessor`: Text preprocessing
|
||||
- Utility functions (mask generation, tensor conversion, etc.)
|
||||
|
||||
- `example_onnx.js`: Main inference script
|
||||
- ONNX model loading
|
||||
- TTS inference pipeline execution
|
||||
- WAV file saving
|
||||
|
||||
- `package.json`: Node.js project configuration and dependencies
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
1. **Pure Node.js WAV Processing**: Writes WAV files without external native libraries. Outputs 16-bit PCM format.
|
||||
|
||||
2. **Memory Efficiency**: Note that Node.js may consume significant memory when processing large arrays.
|
||||
|
||||
3. **Performance**: The mel spectrogram extraction (Step 1-1) is currently slower than Python's Librosa, which uses highly optimized C extensions. This bottleneck could be further improved with additional optimizations such as WASM-based FFT libraries or native addons.
|
||||
119
nodejs/example_onnx.js
Normal file
@@ -0,0 +1,119 @@
|
||||
import fs from 'fs';
|
||||
import path from 'path';
|
||||
import { fileURLToPath } from 'url';
|
||||
|
||||
import { loadTextToSpeech, loadVoiceStyle, timer, writeWavFile, sanitizeFilename } from './helper.js';
|
||||
|
||||
const __filename = fileURLToPath(import.meta.url);
|
||||
const __dirname = path.dirname(__filename);
|
||||
|
||||
/**
|
||||
* Parse command line arguments
|
||||
*/
|
||||
function parseArgs() {
|
||||
const args = {
|
||||
useGpu: false,
|
||||
onnxDir: 'assets/onnx',
|
||||
totalStep: 5,
|
||||
speed: 1.05,
|
||||
nTest: 4,
|
||||
voiceStyle: ['assets/voice_styles/M1.json'],
|
||||
text: ['This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.'],
|
||||
lang: ['en'],
|
||||
saveDir: 'results',
|
||||
batch: false
|
||||
};
|
||||
|
||||
for (let i = 2; i < process.argv.length; i++) {
|
||||
const arg = process.argv[i];
|
||||
if (arg === '--use-gpu') {
|
||||
args.useGpu = true;
|
||||
} else if (arg === '--batch') {
|
||||
args.batch = true;
|
||||
} else if (arg === '--onnx-dir' && i + 1 < process.argv.length) {
|
||||
args.onnxDir = process.argv[++i];
|
||||
} else if (arg === '--total-step' && i + 1 < process.argv.length) {
|
||||
args.totalStep = parseInt(process.argv[++i]);
|
||||
} else if (arg === '--speed' && i + 1 < process.argv.length) {
|
||||
args.speed = parseFloat(process.argv[++i]);
|
||||
} else if (arg === '--n-test' && i + 1 < process.argv.length) {
|
||||
args.nTest = parseInt(process.argv[++i]);
|
||||
} else if (arg === '--voice-style' && i + 1 < process.argv.length) {
|
||||
args.voiceStyle = process.argv[++i].split(',');
|
||||
} else if (arg === '--text' && i + 1 < process.argv.length) {
|
||||
args.text = process.argv[++i].split('|');
|
||||
} else if (arg === '--lang' && i + 1 < process.argv.length) {
|
||||
args.lang = process.argv[++i].split(',');
|
||||
} else if (arg === '--save-dir' && i + 1 < process.argv.length) {
|
||||
args.saveDir = process.argv[++i];
|
||||
}
|
||||
}
|
||||
|
||||
return args;
|
||||
}
|
||||
|
||||
/**
|
||||
* Main inference function
|
||||
*/
|
||||
async function main() {
|
||||
console.log('=== TTS Inference with ONNX Runtime (Node.js) ===\n');
|
||||
|
||||
// --- 1. Parse arguments --- //
|
||||
const args = parseArgs();
|
||||
const totalStep = args.totalStep;
|
||||
const speed = args.speed;
|
||||
const nTest = args.nTest;
|
||||
const saveDir = args.saveDir;
|
||||
const voiceStylePaths = args.voiceStyle.map(p => path.resolve(__dirname, p));
|
||||
const textList = args.text;
|
||||
const langList = args.lang;
|
||||
const batch = args.batch;
|
||||
|
||||
if (voiceStylePaths.length !== textList.length) {
|
||||
throw new Error(`Number of voice styles (${voiceStylePaths.length}) must match number of texts (${textList.length})`);
|
||||
}
|
||||
const bsz = voiceStylePaths.length;
|
||||
|
||||
// --- 2. Load Text to Speech --- //
|
||||
const onnxDir = path.resolve(__dirname, args.onnxDir);
|
||||
const textToSpeech = await loadTextToSpeech(onnxDir, args.useGpu);
|
||||
|
||||
// --- 3. Load Voice Style --- //
|
||||
const style = loadVoiceStyle(voiceStylePaths, true);
|
||||
|
||||
// --- 4. Synthesize speech --- //
|
||||
for (let n = 0; n < nTest; n++) {
|
||||
console.log(`\n[${n + 1}/${nTest}] Starting synthesis...`);
|
||||
|
||||
const { wav, duration } = await timer('Generating speech from text', async () => {
|
||||
if (batch) {
|
||||
return await textToSpeech.batch(textList, langList, style, totalStep, speed);
|
||||
} else {
|
||||
return await textToSpeech.call(textList[0], langList[0], style, totalStep, speed);
|
||||
}
|
||||
});
|
||||
|
||||
if (!fs.existsSync(saveDir)) {
|
||||
fs.mkdirSync(saveDir, { recursive: true });
|
||||
}
|
||||
|
||||
const wavShape = [bsz, wav.length / bsz];
|
||||
for (let b = 0; b < bsz; b++) {
|
||||
const fname = `${sanitizeFilename(textList[b], 20)}_${n + 1}.wav`;
|
||||
const wavLen = Math.floor(textToSpeech.sampleRate * duration[b]);
|
||||
const wavOut = wav.slice(b * wavShape[1], b * wavShape[1] + wavLen);
|
||||
|
||||
const outputPath = path.join(saveDir, fname);
|
||||
writeWavFile(outputPath, wavOut, textToSpeech.sampleRate);
|
||||
console.log(`Saved: ${outputPath}`);
|
||||
}
|
||||
}
|
||||
|
||||
console.log('\n=== Synthesis completed successfully! ===');
|
||||
}
|
||||
|
||||
// Run main function
|
||||
main().catch(err => {
|
||||
console.error('Error during inference:', err);
|
||||
process.exit(1);
|
||||
});
|
||||
559
nodejs/helper.js
Normal file
@@ -0,0 +1,559 @@
|
||||
import fs from 'fs';
|
||||
import path from 'path';
|
||||
import { fileURLToPath } from 'url';
|
||||
import * as ort from 'onnxruntime-node';
|
||||
|
||||
const __filename = fileURLToPath(import.meta.url);
|
||||
|
||||
const AVAILABLE_LANGS = ["en", "ko", "es", "pt", "fr"];
|
||||
|
||||
/**
|
||||
* Unicode text processor
|
||||
*/
|
||||
class UnicodeProcessor {
|
||||
constructor(unicodeIndexerJsonPath) {
|
||||
this.indexer = JSON.parse(fs.readFileSync(unicodeIndexerJsonPath, 'utf8'));
|
||||
}
|
||||
|
||||
_preprocessText(text, lang) {
|
||||
// TODO: Need advanced normalizer for better performance
|
||||
text = text.normalize('NFKD');
|
||||
|
||||
// Remove emojis (wide Unicode range)
|
||||
const emojiPattern = /[\u{1F600}-\u{1F64F}\u{1F300}-\u{1F5FF}\u{1F680}-\u{1F6FF}\u{1F700}-\u{1F77F}\u{1F780}-\u{1F7FF}\u{1F800}-\u{1F8FF}\u{1F900}-\u{1F9FF}\u{1FA00}-\u{1FA6F}\u{1FA70}-\u{1FAFF}\u{2600}-\u{26FF}\u{2700}-\u{27BF}\u{1F1E6}-\u{1F1FF}]+/gu;
|
||||
text = text.replace(emojiPattern, '');
|
||||
|
||||
// Replace various dashes and symbols
|
||||
const replacements = {
|
||||
'–': '-',
|
||||
'‑': '-',
|
||||
'—': '-',
|
||||
'_': ' ',
|
||||
'\u201C': '"', // left double quote "
|
||||
'\u201D': '"', // right double quote "
|
||||
'\u2018': "'", // left single quote '
|
||||
'\u2019': "'", // right single quote '
|
||||
'´': "'",
|
||||
'`': "'",
|
||||
'[': ' ',
|
||||
']': ' ',
|
||||
'|': ' ',
|
||||
'/': ' ',
|
||||
'#': ' ',
|
||||
'→': ' ',
|
||||
'←': ' ',
|
||||
};
|
||||
for (const [k, v] of Object.entries(replacements)) {
|
||||
text = text.replaceAll(k, v);
|
||||
}
|
||||
|
||||
// Remove special symbols
|
||||
text = text.replace(/[♥☆♡©\\]/g, '');
|
||||
|
||||
// Replace known expressions
|
||||
const exprReplacements = {
|
||||
'@': ' at ',
|
||||
'e.g.,': 'for example, ',
|
||||
'i.e.,': 'that is, ',
|
||||
};
|
||||
for (const [k, v] of Object.entries(exprReplacements)) {
|
||||
text = text.replaceAll(k, v);
|
||||
}
|
||||
|
||||
// Fix spacing around punctuation
|
||||
text = text.replace(/ ,/g, ',');
|
||||
text = text.replace(/ \./g, '.');
|
||||
text = text.replace(/ !/g, '!');
|
||||
text = text.replace(/ \?/g, '?');
|
||||
text = text.replace(/ ;/g, ';');
|
||||
text = text.replace(/ :/g, ':');
|
||||
text = text.replace(/ '/g, "'");
|
||||
|
||||
// Remove duplicate quotes
|
||||
while (text.includes('""')) {
|
||||
text = text.replace('""', '"');
|
||||
}
|
||||
while (text.includes("''")) {
|
||||
text = text.replace("''", "'");
|
||||
}
|
||||
while (text.includes('``')) {
|
||||
text = text.replace('``', '`');
|
||||
}
|
||||
|
||||
// Remove extra spaces
|
||||
text = text.replace(/\s+/g, ' ').trim();
|
||||
|
||||
// If text doesn't end with punctuation, quotes, or closing brackets, add a period
|
||||
if (!/[.!?;:,'\"')\]}…。」』】〉》›»]$/.test(text)) {
|
||||
text += '.';
|
||||
}
|
||||
|
||||
// Validate language
|
||||
if (!AVAILABLE_LANGS.includes(lang)) {
|
||||
throw new Error(`Invalid language: ${lang}. Available: ${AVAILABLE_LANGS.join(', ')}`);
|
||||
}
|
||||
|
||||
// Wrap text with language tags
|
||||
text = `<${lang}>` + text + `</${lang}>`;
|
||||
|
||||
return text;
|
||||
}
|
||||
|
||||
_textToUnicodeValues(text) {
|
||||
return Array.from(text).map(char => char.charCodeAt(0));
|
||||
}
|
||||
|
||||
_getTextMask(textIdsLengths) {
|
||||
return lengthToMask(textIdsLengths);
|
||||
}
|
||||
|
||||
call(textList, langList) {
|
||||
const processedTexts = textList.map((t, i) => this._preprocessText(t, langList[i]));
|
||||
const textIdsLengths = processedTexts.map(t => t.length);
|
||||
const maxLen = Math.max(...textIdsLengths);
|
||||
|
||||
const textIds = [];
|
||||
for (let i = 0; i < processedTexts.length; i++) {
|
||||
const row = new Array(maxLen).fill(0);
|
||||
const unicodeVals = this._textToUnicodeValues(processedTexts[i]);
|
||||
for (let j = 0; j < unicodeVals.length; j++) {
|
||||
row[j] = this.indexer[unicodeVals[j]];
|
||||
}
|
||||
textIds.push(row);
|
||||
}
|
||||
|
||||
const textMask = this._getTextMask(textIdsLengths);
|
||||
return { textIds, textMask };
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Style class
|
||||
*/
|
||||
class Style {
|
||||
constructor(styleTtlOnnx, styleDpOnnx) {
|
||||
this.ttl = styleTtlOnnx;
|
||||
this.dp = styleDpOnnx;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* TextToSpeech class
|
||||
*/
|
||||
class TextToSpeech {
|
||||
constructor(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) {
|
||||
this.cfgs = cfgs;
|
||||
this.textProcessor = textProcessor;
|
||||
this.dpOrt = dpOrt;
|
||||
this.textEncOrt = textEncOrt;
|
||||
this.vectorEstOrt = vectorEstOrt;
|
||||
this.vocoderOrt = vocoderOrt;
|
||||
this.sampleRate = cfgs.ae.sample_rate;
|
||||
this.baseChunkSize = cfgs.ae.base_chunk_size;
|
||||
this.chunkCompressFactor = cfgs.ttl.chunk_compress_factor;
|
||||
this.ldim = cfgs.ttl.latent_dim;
|
||||
}
|
||||
|
||||
sampleNoisyLatent(duration) {
|
||||
const wavLenMax = Math.max(...duration) * this.sampleRate;
|
||||
const wavLengths = duration.map(d => Math.floor(d * this.sampleRate));
|
||||
const chunkSize = this.baseChunkSize * this.chunkCompressFactor;
|
||||
const latentLen = Math.floor((wavLenMax + chunkSize - 1) / chunkSize);
|
||||
const latentDim = this.ldim * this.chunkCompressFactor;
|
||||
|
||||
// Generate random noise
|
||||
const noisyLatent = [];
|
||||
for (let b = 0; b < duration.length; b++) {
|
||||
const batch = [];
|
||||
for (let d = 0; d < latentDim; d++) {
|
||||
const row = [];
|
||||
for (let t = 0; t < latentLen; t++) {
|
||||
// Box-Muller transform for normal distribution
|
||||
// Add epsilon to avoid log(0)
|
||||
const eps = 1e-10;
|
||||
const u1 = Math.max(eps, Math.random());
|
||||
const u2 = Math.random();
|
||||
const randNormal = Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2);
|
||||
row.push(randNormal);
|
||||
}
|
||||
batch.push(row);
|
||||
}
|
||||
noisyLatent.push(batch);
|
||||
}
|
||||
|
||||
const latentMask = getLatentMask(wavLengths, this.baseChunkSize, this.chunkCompressFactor);
|
||||
|
||||
// Apply mask
|
||||
for (let b = 0; b < noisyLatent.length; b++) {
|
||||
for (let d = 0; d < noisyLatent[b].length; d++) {
|
||||
for (let t = 0; t < noisyLatent[b][d].length; t++) {
|
||||
noisyLatent[b][d][t] *= latentMask[b][0][t];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return { noisyLatent, latentMask };
|
||||
}
|
||||
|
||||
async _infer(textList, langList, style, totalStep, speed = 1.05) {
|
||||
if (textList.length !== style.ttl.dims[0]) {
|
||||
throw new Error('Number of texts must match number of style vectors');
|
||||
}
|
||||
const bsz = textList.length;
|
||||
const { textIds, textMask } = this.textProcessor.call(textList, langList);
|
||||
const textIdsShape = [bsz, textIds[0].length];
|
||||
const textMaskShape = [bsz, 1, textMask[0][0].length];
|
||||
|
||||
const textMaskTensor = arrayToTensor(textMask, textMaskShape);
|
||||
|
||||
const dpResult = await this.dpOrt.run({
|
||||
text_ids: intArrayToTensor(textIds, textIdsShape),
|
||||
style_dp: style.dp,
|
||||
text_mask: textMaskTensor
|
||||
});
|
||||
|
||||
const durOnnx = Array.from(dpResult.duration.data);
|
||||
|
||||
// Apply speed factor to duration
|
||||
for (let i = 0; i < durOnnx.length; i++) {
|
||||
durOnnx[i] /= speed;
|
||||
}
|
||||
|
||||
const textEncResult = await this.textEncOrt.run({
|
||||
text_ids: intArrayToTensor(textIds, textIdsShape),
|
||||
style_ttl: style.ttl,
|
||||
text_mask: textMaskTensor
|
||||
});
|
||||
|
||||
const textEmbTensor = textEncResult.text_emb;
|
||||
|
||||
let { noisyLatent, latentMask } = this.sampleNoisyLatent(durOnnx);
|
||||
const latentShape = [bsz, noisyLatent[0].length, noisyLatent[0][0].length];
|
||||
const latentMaskShape = [bsz, 1, latentMask[0][0].length];
|
||||
|
||||
const latentMaskTensor = arrayToTensor(latentMask, latentMaskShape);
|
||||
|
||||
const totalStepArray = new Array(bsz).fill(totalStep);
|
||||
const scalarShape = [bsz];
|
||||
const totalStepTensor = arrayToTensor(totalStepArray, scalarShape);
|
||||
|
||||
for (let step = 0; step < totalStep; step++) {
|
||||
const currentStepArray = new Array(bsz).fill(step);
|
||||
|
||||
const vectorEstResult = await this.vectorEstOrt.run({
|
||||
noisy_latent: arrayToTensor(noisyLatent, latentShape),
|
||||
text_emb: textEmbTensor,
|
||||
style_ttl: style.ttl,
|
||||
text_mask: textMaskTensor,
|
||||
latent_mask: latentMaskTensor,
|
||||
total_step: totalStepTensor,
|
||||
current_step: arrayToTensor(currentStepArray, scalarShape)
|
||||
});
|
||||
|
||||
const denoisedLatent = Array.from(vectorEstResult.denoised_latent.data);
|
||||
|
||||
// Update latent with the denoised output
|
||||
let idx = 0;
|
||||
for (let b = 0; b < noisyLatent.length; b++) {
|
||||
for (let d = 0; d < noisyLatent[b].length; d++) {
|
||||
for (let t = 0; t < noisyLatent[b][d].length; t++) {
|
||||
noisyLatent[b][d][t] = denoisedLatent[idx++];
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
const vocoderResult = await this.vocoderOrt.run({
|
||||
latent: arrayToTensor(noisyLatent, latentShape)
|
||||
});
|
||||
|
||||
const wav = Array.from(vocoderResult.wav_tts.data);
|
||||
return { wav, duration: durOnnx };
|
||||
}
|
||||
|
||||
async call(text, lang, style, totalStep, speed = 1.05, silenceDuration = 0.3) {
|
||||
if (style.ttl.dims[0] !== 1) {
|
||||
throw new Error('Single speaker text to speech only supports single style');
|
||||
}
|
||||
const maxLen = lang === 'ko' ? 120 : 300;
|
||||
const textList = chunkText(text, maxLen);
|
||||
let wavCat = null;
|
||||
let durCat = 0;
|
||||
|
||||
for (const chunk of textList) {
|
||||
const { wav, duration } = await this._infer([chunk], [lang], style, totalStep, speed);
|
||||
|
||||
if (wavCat === null) {
|
||||
wavCat = wav;
|
||||
durCat = duration[0];
|
||||
} else {
|
||||
const silenceLen = Math.floor(silenceDuration * this.sampleRate);
|
||||
const silence = new Array(silenceLen).fill(0);
|
||||
wavCat = [...wavCat, ...silence, ...wav];
|
||||
durCat += duration[0] + silenceDuration;
|
||||
}
|
||||
}
|
||||
|
||||
return { wav: wavCat, duration: [durCat] };
|
||||
}
|
||||
|
||||
async batch(textList, langList, style, totalStep, speed = 1.05) {
|
||||
return await this._infer(textList, langList, style, totalStep, speed);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Convert lengths to binary mask
|
||||
*/
|
||||
function lengthToMask(lengths, maxLen = null) {
|
||||
maxLen = maxLen || Math.max(...lengths);
|
||||
const mask = [];
|
||||
for (let i = 0; i < lengths.length; i++) {
|
||||
const row = [];
|
||||
for (let j = 0; j < maxLen; j++) {
|
||||
row.push(j < lengths[i] ? 1.0 : 0.0);
|
||||
}
|
||||
mask.push([row]); // [B, 1, maxLen]
|
||||
}
|
||||
return mask;
|
||||
}
|
||||
|
||||
/**
|
||||
* Get latent mask from wav lengths
|
||||
*/
|
||||
function getLatentMask(wavLengths, baseChunkSize, chunkCompressFactor) {
|
||||
const latentSize = baseChunkSize * chunkCompressFactor;
|
||||
const latentLengths = wavLengths.map(len =>
|
||||
Math.floor((len + latentSize - 1) / latentSize)
|
||||
);
|
||||
return lengthToMask(latentLengths);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load ONNX model
|
||||
*/
|
||||
async function loadOnnx(onnxPath, opts) {
|
||||
return await ort.InferenceSession.create(onnxPath, opts);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load all ONNX models for TTS
|
||||
*/
|
||||
async function loadOnnxAll(onnxDir, opts) {
|
||||
const dpPath = path.join(onnxDir, 'duration_predictor.onnx');
|
||||
const textEncPath = path.join(onnxDir, 'text_encoder.onnx');
|
||||
const vectorEstPath = path.join(onnxDir, 'vector_estimator.onnx');
|
||||
const vocoderPath = path.join(onnxDir, 'vocoder.onnx');
|
||||
|
||||
const [dpOrt, textEncOrt, vectorEstOrt, vocoderOrt] = await Promise.all([
|
||||
loadOnnx(dpPath, opts),
|
||||
loadOnnx(textEncPath, opts),
|
||||
loadOnnx(vectorEstPath, opts),
|
||||
loadOnnx(vocoderPath, opts)
|
||||
]);
|
||||
|
||||
return { dpOrt, textEncOrt, vectorEstOrt, vocoderOrt };
|
||||
}
|
||||
|
||||
/**
|
||||
* Load configuration
|
||||
*/
|
||||
function loadCfgs(onnxDir) {
|
||||
const cfgPath = path.join(onnxDir, 'tts.json');
|
||||
const cfgs = JSON.parse(fs.readFileSync(cfgPath, 'utf8'));
|
||||
return cfgs;
|
||||
}
|
||||
|
||||
/**
|
||||
* Load text processor
|
||||
*/
|
||||
function loadTextProcessor(onnxDir) {
|
||||
const unicodeIndexerPath = path.join(onnxDir, 'unicode_indexer.json');
|
||||
const textProcessor = new UnicodeProcessor(unicodeIndexerPath);
|
||||
return textProcessor;
|
||||
}
|
||||
|
||||
/**
|
||||
* Load voice style from JSON file
|
||||
*/
|
||||
export function loadVoiceStyle(voiceStylePaths, verbose = false) {
|
||||
const bsz = voiceStylePaths.length;
|
||||
|
||||
// Read first file to get dimensions
|
||||
const firstStyle = JSON.parse(fs.readFileSync(voiceStylePaths[0], 'utf8'));
|
||||
const ttlDims = firstStyle.style_ttl.dims;
|
||||
const dpDims = firstStyle.style_dp.dims;
|
||||
|
||||
const ttlDim1 = ttlDims[1];
|
||||
const ttlDim2 = ttlDims[2];
|
||||
const dpDim1 = dpDims[1];
|
||||
const dpDim2 = dpDims[2];
|
||||
|
||||
// Pre-allocate arrays with full batch size
|
||||
const ttlSize = bsz * ttlDim1 * ttlDim2;
|
||||
const dpSize = bsz * dpDim1 * dpDim2;
|
||||
const ttlFlat = new Float32Array(ttlSize);
|
||||
const dpFlat = new Float32Array(dpSize);
|
||||
|
||||
// Fill in the data
|
||||
for (let i = 0; i < bsz; i++) {
|
||||
const voiceStyle = JSON.parse(fs.readFileSync(voiceStylePaths[i], 'utf8'));
|
||||
|
||||
const ttlData = voiceStyle.style_ttl.data.flat(Infinity);
|
||||
const ttlOffset = i * ttlDim1 * ttlDim2;
|
||||
ttlFlat.set(ttlData, ttlOffset);
|
||||
|
||||
const dpData = voiceStyle.style_dp.data.flat(Infinity);
|
||||
const dpOffset = i * dpDim1 * dpDim2;
|
||||
dpFlat.set(dpData, dpOffset);
|
||||
}
|
||||
|
||||
const ttlStyle = new ort.Tensor('float32', ttlFlat, [bsz, ttlDim1, ttlDim2]);
|
||||
const dpStyle = new ort.Tensor('float32', dpFlat, [bsz, dpDim1, dpDim2]);
|
||||
|
||||
if (verbose) {
|
||||
console.log(`Loaded ${bsz} voice styles`);
|
||||
}
|
||||
|
||||
return new Style(ttlStyle, dpStyle);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load text to speech components
|
||||
*/
|
||||
export async function loadTextToSpeech(onnxDir, useGpu = false) {
|
||||
const opts = {};
|
||||
if (useGpu) {
|
||||
throw new Error('GPU mode is not supported yet');
|
||||
} else {
|
||||
console.log('Using CPU for inference');
|
||||
}
|
||||
|
||||
const cfgs = loadCfgs(onnxDir);
|
||||
const { dpOrt, textEncOrt, vectorEstOrt, vocoderOrt } = await loadOnnxAll(onnxDir, opts);
|
||||
const textProcessor = loadTextProcessor(onnxDir);
|
||||
const textToSpeech = new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
|
||||
|
||||
return textToSpeech;
|
||||
}
|
||||
|
||||
/**
|
||||
* Convert 3D array to ONNX tensor
|
||||
*/
|
||||
function arrayToTensor(array, dims) {
|
||||
// Flatten the array
|
||||
const flat = array.flat(Infinity);
|
||||
return new ort.Tensor('float32', Float32Array.from(flat), dims);
|
||||
}
|
||||
|
||||
/**
|
||||
* Convert 2D int array to ONNX tensor
|
||||
*/
|
||||
function intArrayToTensor(array, dims) {
|
||||
const flat = array.flat(Infinity);
|
||||
return new ort.Tensor('int64', BigInt64Array.from(flat.map(x => BigInt(x))), dims);
|
||||
}
|
||||
|
||||
/**
|
||||
* Write WAV file
|
||||
*/
|
||||
export function writeWavFile(filename, audioData, sampleRate) {
|
||||
const numChannels = 1;
|
||||
const bitsPerSample = 16;
|
||||
const byteRate = sampleRate * numChannels * bitsPerSample / 8;
|
||||
const blockAlign = numChannels * bitsPerSample / 8;
|
||||
const dataSize = audioData.length * bitsPerSample / 8;
|
||||
|
||||
const buffer = Buffer.alloc(44 + dataSize);
|
||||
|
||||
// RIFF header
|
||||
buffer.write('RIFF', 0);
|
||||
buffer.writeUInt32LE(36 + dataSize, 4);
|
||||
buffer.write('WAVE', 8);
|
||||
|
||||
// fmt chunk
|
||||
buffer.write('fmt ', 12);
|
||||
buffer.writeUInt32LE(16, 16); // fmt chunk size
|
||||
buffer.writeUInt16LE(1, 20); // audio format (PCM)
|
||||
buffer.writeUInt16LE(numChannels, 22);
|
||||
buffer.writeUInt32LE(sampleRate, 24);
|
||||
buffer.writeUInt32LE(byteRate, 28);
|
||||
buffer.writeUInt16LE(blockAlign, 32);
|
||||
buffer.writeUInt16LE(bitsPerSample, 34);
|
||||
|
||||
// data chunk
|
||||
buffer.write('data', 36);
|
||||
buffer.writeUInt32LE(dataSize, 40);
|
||||
|
||||
// Write audio data
|
||||
for (let i = 0; i < audioData.length; i++) {
|
||||
const sample = Math.max(-1, Math.min(1, audioData[i]));
|
||||
const intSample = Math.floor(sample * 32767);
|
||||
buffer.writeInt16LE(intSample, 44 + i * 2);
|
||||
}
|
||||
|
||||
fs.writeFileSync(filename, buffer);
|
||||
}
|
||||
|
||||
/**
|
||||
* Timer utility for measuring execution time
|
||||
*/
|
||||
export async function timer(name, fn) {
|
||||
const start = Date.now();
|
||||
console.log(`${name}...`);
|
||||
const result = await fn();
|
||||
const elapsed = ((Date.now() - start) / 1000).toFixed(2);
|
||||
console.log(` -> ${name} completed in ${elapsed} sec`);
|
||||
return result;
|
||||
}
|
||||
|
||||
/**
|
||||
* Sanitize filename by replacing non-alphanumeric characters with underscores (supports Unicode)
|
||||
*/
|
||||
export function sanitizeFilename(text, maxLen) {
|
||||
const prefix = text.substring(0, maxLen);
|
||||
// \p{L} matches any Unicode letter, \p{N} matches any Unicode number
|
||||
return prefix.replace(/[^\p{L}\p{N}_]/gu, '_');
|
||||
}
|
||||
|
||||
/**
|
||||
* Chunk text into manageable segments
|
||||
*/
|
||||
function chunkText(text, maxLen = 300) {
|
||||
if (typeof text !== 'string') {
|
||||
throw new Error(`chunkText expects a string, got ${typeof text}`);
|
||||
}
|
||||
|
||||
// Split by paragraph (two or more newlines)
|
||||
const paragraphs = text.trim().split(/\n\s*\n+/).filter(p => p.trim());
|
||||
|
||||
const chunks = [];
|
||||
|
||||
for (let paragraph of paragraphs) {
|
||||
paragraph = paragraph.trim();
|
||||
if (!paragraph) continue;
|
||||
|
||||
// Split by sentence boundaries (period, question mark, exclamation mark followed by space)
|
||||
// But exclude common abbreviations like Mr., Mrs., Dr., etc. and single capital letters like F.
|
||||
const sentences = paragraph.split(/(?<!Mr\.|Mrs\.|Ms\.|Dr\.|Prof\.|Sr\.|Jr\.|Ph\.D\.|etc\.|e\.g\.|i\.e\.|vs\.|Inc\.|Ltd\.|Co\.|Corp\.|St\.|Ave\.|Blvd\.)(?<!\b[A-Z]\.)(?<=[.!?])\s+/);
|
||||
|
||||
let currentChunk = "";
|
||||
|
||||
for (let sentence of sentences) {
|
||||
if (currentChunk.length + sentence.length + 1 <= maxLen) {
|
||||
currentChunk += (currentChunk ? " " : "") + sentence;
|
||||
} else {
|
||||
if (currentChunk) {
|
||||
chunks.push(currentChunk.trim());
|
||||
}
|
||||
currentChunk = sentence;
|
||||
}
|
||||
}
|
||||
|
||||
if (currentChunk) {
|
||||
chunks.push(currentChunk.trim());
|
||||
}
|
||||
}
|
||||
|
||||
return chunks;
|
||||
}
|
||||
26
nodejs/package.json
Normal file
@@ -0,0 +1,26 @@
|
||||
{
|
||||
"name": "tts-onnx-nodejs",
|
||||
"version": "1.0.0",
|
||||
"description": "TTS inference using ONNX Runtime for Node.js",
|
||||
"main": "example_onnx.js",
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"start": "node example_onnx.js"
|
||||
},
|
||||
"keywords": [
|
||||
"tts",
|
||||
"onnx",
|
||||
"speech-synthesis",
|
||||
"nodejs"
|
||||
],
|
||||
"author": "",
|
||||
"license": "MIT",
|
||||
"dependencies": {
|
||||
"fft.js": "^4.0.3",
|
||||
"js-yaml": "^4.1.0",
|
||||
"onnxruntime-node": "^1.19.2"
|
||||
},
|
||||
"engines": {
|
||||
"node": ">=16.0.0"
|
||||
}
|
||||
}
|
||||
145
py/README.md
Normal file
@@ -0,0 +1,145 @@
|
||||
# TTS ONNX Inference Examples
|
||||
|
||||
This guide provides examples for running TTS inference using `example_onnx.py`.
|
||||
|
||||
## 📰 Update News
|
||||
|
||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
||||
|
||||
**2025.12.10** - Added `supertonic` PyPI package! Install via `pip install supertonic` for a streamlined experience. This is a separate usage method from the ONNX examples in this directory. For more details, visit [supertonic-py documentation](https://supertone-inc.github.io/supertonic-py) and see `example_pypi.py` for usage.
|
||||
|
||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||
|
||||
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
|
||||
|
||||
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
|
||||
|
||||
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed. Adjust the speed factor to make speech faster or slower while maintaining natural quality.
|
||||
|
||||
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
|
||||
|
||||
## Installation
|
||||
|
||||
This project uses [uv](https://docs.astral.sh/uv/) for fast package management.
|
||||
|
||||
### Install uv (if not already installed)
|
||||
```bash
|
||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
```
|
||||
|
||||
### Install dependencies
|
||||
```bash
|
||||
uv sync
|
||||
```
|
||||
|
||||
Or if you prefer using traditional pip with requirements.txt:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
uv run example_onnx.py
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
uv run example_onnx.py \
|
||||
--voice-style assets/voice_styles/M1.json assets/voice_styles/F1.json \
|
||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange." "오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
|
||||
--lang en ko \
|
||||
--batch
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use `--batch` flag to enable batch processing mode
|
||||
- Generate speech for 2 different voice-text pairs
|
||||
- Use male voice style (M1.json) for the first English text
|
||||
- Use female voice style (F1.json) for the second Korean text
|
||||
- Process both samples in a single batch (automatic text chunking disabled)
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
uv run example_onnx.py \
|
||||
--total-step 10 \
|
||||
--voice-style assets/voice_styles/M1.json \
|
||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
### Example 4: Long-Form Inference
|
||||
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
|
||||
```bash
|
||||
uv run example_onnx.py \
|
||||
--voice-style assets/voice_styles/M1.json \
|
||||
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Automatically split the long text into smaller chunks (max 300 characters by default)
|
||||
- Process each chunk separately while maintaining natural speech flow
|
||||
- Insert brief silences (0.3 seconds) between chunks for natural pacing
|
||||
- Combine all chunks into a single output audio file
|
||||
|
||||
**Note**: When using batch mode (`--batch`), automatic text chunking is disabled. Use non-batch mode for long-form text synthesis.
|
||||
|
||||
### Example 5: Adjusting Speech Speed
|
||||
Control the speed of speech synthesis:
|
||||
```bash
|
||||
# Faster speech (speed > 1.0)
|
||||
uv run example_onnx.py \
|
||||
--voice-style assets/voice_styles/F2.json \
|
||||
--text "This text will be synthesized at a faster pace." \
|
||||
--speed 1.2
|
||||
|
||||
# Slower speech (speed < 1.0)
|
||||
uv run example_onnx.py \
|
||||
--voice-style assets/voice_styles/M2.json \
|
||||
--text "This text will be synthesized at a slower, more deliberate pace." \
|
||||
--speed 0.9
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use `--speed 1.2` to generate faster speech
|
||||
- Use `--speed 0.9` to generate slower speech
|
||||
- Default speed is 1.05 if not specified
|
||||
- Recommended speed range is between 0.9 and 1.5 for natural-sounding results
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `--use-gpu` | flag | False | Use GPU for inference (with CPU fallback) |
|
||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
|
||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
|
||||
| `--text` | str+ | (long default text) | Text(s) to synthesize |
|
||||
| `--lang` | str+ | `en` | Language(s) for text(s): `en`, `ko`, `es`, `pt`, `fr` |
|
||||
| `--save-dir` | str | `results` | Output directory |
|
||||
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
|
||||
- **Multilingual Support**: Use `--lang` to specify language(s). Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
|
||||
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
|
||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||
- **GPU Support**: GPU mode is not supported yet
|
||||
|
||||
116
py/example_onnx.py
Normal file
@@ -0,0 +1,116 @@
|
||||
import argparse
|
||||
import os
|
||||
|
||||
import soundfile as sf
|
||||
|
||||
from helper import load_text_to_speech, timer, sanitize_filename, load_voice_style
|
||||
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser(description="TTS Inference with ONNX")
|
||||
|
||||
# Device settings
|
||||
parser.add_argument(
|
||||
"--use-gpu", action="store_true", help="Use GPU for inference (default: CPU)"
|
||||
)
|
||||
|
||||
# Model settings
|
||||
parser.add_argument(
|
||||
"--onnx-dir",
|
||||
type=str,
|
||||
default="assets/onnx",
|
||||
help="Path to ONNX model directory",
|
||||
)
|
||||
|
||||
# Synthesis parameters
|
||||
parser.add_argument(
|
||||
"--total-step", type=int, default=5, help="Number of denoising steps"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--speed",
|
||||
type=float,
|
||||
default=1.05,
|
||||
help="Speech speed (default: 1.05, higher = faster)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--n-test", type=int, default=4, help="Number of times to generate"
|
||||
)
|
||||
|
||||
# Batch processing
|
||||
parser.add_argument("--batch", action="store_true", help="Batch processing")
|
||||
|
||||
# Input/Output
|
||||
parser.add_argument(
|
||||
"--voice-style",
|
||||
type=str,
|
||||
nargs="+",
|
||||
default=["assets/voice_styles/M1.json"],
|
||||
help="Voice style file path(s). Can specify multiple files for batch processing",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--text",
|
||||
type=str,
|
||||
nargs="+",
|
||||
default=[
|
||||
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
],
|
||||
help="Text(s) to synthesize. Can specify multiple texts for batch processing",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--lang",
|
||||
type=str,
|
||||
nargs="+",
|
||||
default=["en"],
|
||||
help="Language(s) of the text(s). Can specify multiple languages for batch processing",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--save-dir", type=str, default="results", help="Output directory"
|
||||
)
|
||||
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
print("=== TTS Inference with ONNX Runtime (Python) ===\n")
|
||||
|
||||
# --- 1. Parse arguments --- #
|
||||
args = parse_args()
|
||||
total_step = args.total_step
|
||||
speed = args.speed
|
||||
n_test = args.n_test
|
||||
save_dir = args.save_dir
|
||||
voice_style_paths = args.voice_style
|
||||
text_list = args.text
|
||||
lang_list = args.lang
|
||||
batch = args.batch
|
||||
|
||||
assert len(voice_style_paths) == len(
|
||||
text_list
|
||||
), f"Number of voice styles ({len(voice_style_paths)}) must match number of texts ({len(text_list)})"
|
||||
bsz = len(voice_style_paths)
|
||||
|
||||
# --- 2. Load Text to Speech --- #
|
||||
text_to_speech = load_text_to_speech(args.onnx_dir, args.use_gpu)
|
||||
|
||||
# --- 3. Load Voice Style --- #
|
||||
style = load_voice_style(voice_style_paths, verbose=True)
|
||||
|
||||
# --- 4. Synthesize Speech --- #
|
||||
for n in range(n_test):
|
||||
print(f"\n[{n+1}/{n_test}] Starting synthesis...")
|
||||
with timer("Generating speech from text"):
|
||||
if batch:
|
||||
wav, duration = text_to_speech.batch(
|
||||
text_list, lang_list, style, total_step, speed
|
||||
)
|
||||
else:
|
||||
wav, duration = text_to_speech(
|
||||
text_list[0], lang_list[0], style, total_step, speed
|
||||
)
|
||||
if not os.path.exists(save_dir):
|
||||
os.makedirs(save_dir)
|
||||
for b in range(bsz):
|
||||
fname = f"{sanitize_filename(text_list[b], 20)}_{n+1}.wav"
|
||||
w = wav[b, : int(text_to_speech.sample_rate * duration[b].item())] # [T_trim]
|
||||
sf.write(os.path.join(save_dir, fname), w, text_to_speech.sample_rate)
|
||||
print(f"Saved: {save_dir}/{fname}")
|
||||
print("\n=== Synthesis completed successfully! ===")
|
||||
16
py/example_pypi.py
Normal file
@@ -0,0 +1,16 @@
|
||||
from supertonic import TTS
|
||||
|
||||
# Note: First run downloads model automatically (~260MB)
|
||||
tts = TTS(auto_download=True)
|
||||
|
||||
# Get a voice style
|
||||
style = tts.get_voice_style(voice_name="M4")
|
||||
|
||||
# Generate speech
|
||||
text = "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
wav, duration = tts.synthesize(text, voice_style=style)
|
||||
# wav: np.ndarray, shape = (1, num_samples)
|
||||
# duration: np.ndarray, shape = (1,)
|
||||
|
||||
# Save to file
|
||||
tts.save_audio(wav, "results/example_pypi.wav")
|
||||
429
py/helper.py
Normal file
@@ -0,0 +1,429 @@
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
from contextlib import contextmanager
|
||||
from typing import Optional
|
||||
from unicodedata import normalize
|
||||
|
||||
import numpy as np
|
||||
import onnxruntime as ort
|
||||
|
||||
import re
|
||||
|
||||
AVAILABLE_LANGS = ["en", "ko", "es", "pt", "fr"]
|
||||
|
||||
|
||||
class UnicodeProcessor:
|
||||
def __init__(self, unicode_indexer_path: str):
|
||||
with open(unicode_indexer_path, "r") as f:
|
||||
self.indexer = json.load(f)
|
||||
|
||||
def _preprocess_text(self, text: str, lang: str) -> str:
|
||||
# TODO: Need advanced normalizer for better performance
|
||||
text = normalize("NFKD", text)
|
||||
|
||||
# Remove emojis (wide Unicode range)
|
||||
emoji_pattern = re.compile(
|
||||
"[\U0001f600-\U0001f64f" # emoticons
|
||||
"\U0001f300-\U0001f5ff" # symbols & pictographs
|
||||
"\U0001f680-\U0001f6ff" # transport & map symbols
|
||||
"\U0001f700-\U0001f77f"
|
||||
"\U0001f780-\U0001f7ff"
|
||||
"\U0001f800-\U0001f8ff"
|
||||
"\U0001f900-\U0001f9ff"
|
||||
"\U0001fa00-\U0001fa6f"
|
||||
"\U0001fa70-\U0001faff"
|
||||
"\u2600-\u26ff"
|
||||
"\u2700-\u27bf"
|
||||
"\U0001f1e6-\U0001f1ff]+",
|
||||
flags=re.UNICODE,
|
||||
)
|
||||
text = emoji_pattern.sub("", text)
|
||||
|
||||
# Replace various dashes and symbols
|
||||
replacements = {
|
||||
"–": "-",
|
||||
"‑": "-",
|
||||
"—": "-",
|
||||
"_": " ",
|
||||
"\u201c": '"', # left double quote "
|
||||
"\u201d": '"', # right double quote "
|
||||
"\u2018": "'", # left single quote '
|
||||
"\u2019": "'", # right single quote '
|
||||
"´": "'",
|
||||
"`": "'",
|
||||
"[": " ",
|
||||
"]": " ",
|
||||
"|": " ",
|
||||
"/": " ",
|
||||
"#": " ",
|
||||
"→": " ",
|
||||
"←": " ",
|
||||
}
|
||||
for k, v in replacements.items():
|
||||
text = text.replace(k, v)
|
||||
|
||||
# Remove special symbols
|
||||
text = re.sub(r"[♥☆♡©\\]", "", text)
|
||||
|
||||
# Replace known expressions
|
||||
expr_replacements = {
|
||||
"@": " at ",
|
||||
"e.g.,": "for example, ",
|
||||
"i.e.,": "that is, ",
|
||||
}
|
||||
for k, v in expr_replacements.items():
|
||||
text = text.replace(k, v)
|
||||
|
||||
# Fix spacing around punctuation
|
||||
text = re.sub(r" ,", ",", text)
|
||||
text = re.sub(r" \.", ".", text)
|
||||
text = re.sub(r" !", "!", text)
|
||||
text = re.sub(r" \?", "?", text)
|
||||
text = re.sub(r" ;", ";", text)
|
||||
text = re.sub(r" :", ":", text)
|
||||
text = re.sub(r" '", "'", text)
|
||||
|
||||
# Remove duplicate quotes
|
||||
while '""' in text:
|
||||
text = text.replace('""', '"')
|
||||
while "''" in text:
|
||||
text = text.replace("''", "'")
|
||||
while "``" in text:
|
||||
text = text.replace("``", "`")
|
||||
|
||||
# Remove extra spaces
|
||||
text = re.sub(r"\s+", " ", text).strip()
|
||||
|
||||
# If text doesn't end with punctuation, quotes, or closing brackets, add a period
|
||||
if not re.search(r"[.!?;:,'\"')\]}…。」』】〉》›»]$", text):
|
||||
text += "."
|
||||
|
||||
if lang not in AVAILABLE_LANGS:
|
||||
raise ValueError(f"Invalid language: {lang}")
|
||||
text = f"<{lang}>" + text + f"</{lang}>"
|
||||
return text
|
||||
|
||||
def _get_text_mask(self, text_ids_lengths: np.ndarray) -> np.ndarray:
|
||||
text_mask = length_to_mask(text_ids_lengths)
|
||||
return text_mask
|
||||
|
||||
def _text_to_unicode_values(self, text: str) -> np.ndarray:
|
||||
unicode_values = np.array(
|
||||
[ord(char) for char in text], dtype=np.uint16
|
||||
) # 2 bytes
|
||||
return unicode_values
|
||||
|
||||
def __call__(
|
||||
self, text_list: list[str], lang_list: list[str]
|
||||
) -> tuple[np.ndarray, np.ndarray]:
|
||||
text_list = [
|
||||
self._preprocess_text(t, lang) for t, lang in zip(text_list, lang_list)
|
||||
]
|
||||
text_ids_lengths = np.array([len(text) for text in text_list], dtype=np.int64)
|
||||
text_ids = np.zeros((len(text_list), text_ids_lengths.max()), dtype=np.int64)
|
||||
for i, text in enumerate(text_list):
|
||||
unicode_vals = self._text_to_unicode_values(text)
|
||||
text_ids[i, : len(unicode_vals)] = np.array(
|
||||
[self.indexer[val] for val in unicode_vals], dtype=np.int64
|
||||
)
|
||||
text_mask = self._get_text_mask(text_ids_lengths)
|
||||
return text_ids, text_mask
|
||||
|
||||
|
||||
class Style:
|
||||
def __init__(self, style_ttl_onnx: np.ndarray, style_dp_onnx: np.ndarray):
|
||||
self.ttl = style_ttl_onnx
|
||||
self.dp = style_dp_onnx
|
||||
|
||||
|
||||
class TextToSpeech:
|
||||
def __init__(
|
||||
self,
|
||||
cfgs: dict,
|
||||
text_processor: UnicodeProcessor,
|
||||
dp_ort: ort.InferenceSession,
|
||||
text_enc_ort: ort.InferenceSession,
|
||||
vector_est_ort: ort.InferenceSession,
|
||||
vocoder_ort: ort.InferenceSession,
|
||||
):
|
||||
self.cfgs = cfgs
|
||||
self.text_processor = text_processor
|
||||
self.dp_ort = dp_ort
|
||||
self.text_enc_ort = text_enc_ort
|
||||
self.vector_est_ort = vector_est_ort
|
||||
self.vocoder_ort = vocoder_ort
|
||||
self.sample_rate = cfgs["ae"]["sample_rate"]
|
||||
self.base_chunk_size = cfgs["ae"]["base_chunk_size"]
|
||||
self.chunk_compress_factor = cfgs["ttl"]["chunk_compress_factor"]
|
||||
self.ldim = cfgs["ttl"]["latent_dim"]
|
||||
|
||||
def sample_noisy_latent(
|
||||
self, duration: np.ndarray
|
||||
) -> tuple[np.ndarray, np.ndarray]:
|
||||
bsz = len(duration)
|
||||
wav_len_max = duration.max() * self.sample_rate
|
||||
wav_lengths = (duration * self.sample_rate).astype(np.int64)
|
||||
chunk_size = self.base_chunk_size * self.chunk_compress_factor
|
||||
latent_len = ((wav_len_max + chunk_size - 1) / chunk_size).astype(np.int32)
|
||||
latent_dim = self.ldim * self.chunk_compress_factor
|
||||
noisy_latent = np.random.randn(bsz, latent_dim, latent_len).astype(np.float32)
|
||||
latent_mask = get_latent_mask(
|
||||
wav_lengths, self.base_chunk_size, self.chunk_compress_factor
|
||||
)
|
||||
noisy_latent = noisy_latent * latent_mask
|
||||
return noisy_latent, latent_mask
|
||||
|
||||
def _infer(
|
||||
self,
|
||||
text_list: list[str],
|
||||
lang_list: list[str],
|
||||
style: Style,
|
||||
total_step: int,
|
||||
speed: float = 1.05,
|
||||
) -> tuple[np.ndarray, np.ndarray]:
|
||||
assert (
|
||||
len(text_list) == style.ttl.shape[0]
|
||||
), "Number of texts must match number of style vectors"
|
||||
bsz = len(text_list)
|
||||
text_ids, text_mask = self.text_processor(text_list, lang_list)
|
||||
dur_onnx, *_ = self.dp_ort.run(
|
||||
None, {"text_ids": text_ids, "style_dp": style.dp, "text_mask": text_mask}
|
||||
)
|
||||
dur_onnx = dur_onnx / speed
|
||||
text_emb_onnx, *_ = self.text_enc_ort.run(
|
||||
None,
|
||||
{"text_ids": text_ids, "style_ttl": style.ttl, "text_mask": text_mask},
|
||||
) # dur_onnx: [bsz]
|
||||
xt, latent_mask = self.sample_noisy_latent(dur_onnx)
|
||||
total_step_np = np.array([total_step] * bsz, dtype=np.float32)
|
||||
for step in range(total_step):
|
||||
current_step = np.array([step] * bsz, dtype=np.float32)
|
||||
xt, *_ = self.vector_est_ort.run(
|
||||
None,
|
||||
{
|
||||
"noisy_latent": xt,
|
||||
"text_emb": text_emb_onnx,
|
||||
"style_ttl": style.ttl,
|
||||
"text_mask": text_mask,
|
||||
"latent_mask": latent_mask,
|
||||
"current_step": current_step,
|
||||
"total_step": total_step_np,
|
||||
},
|
||||
)
|
||||
wav, *_ = self.vocoder_ort.run(None, {"latent": xt})
|
||||
return wav, dur_onnx
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
text: str,
|
||||
lang: str,
|
||||
style: Style,
|
||||
total_step: int,
|
||||
speed: float = 1.05,
|
||||
silence_duration: float = 0.3,
|
||||
) -> tuple[np.ndarray, np.ndarray]:
|
||||
assert (
|
||||
style.ttl.shape[0] == 1
|
||||
), "Single speaker text to speech only supports single style"
|
||||
max_len = 120 if lang == "ko" else 300
|
||||
text_list = chunk_text(text, max_len=max_len)
|
||||
wav_cat = None
|
||||
dur_cat = None
|
||||
for text in text_list:
|
||||
wav, dur_onnx = self._infer([text], [lang], style, total_step, speed)
|
||||
if wav_cat is None:
|
||||
wav_cat = wav
|
||||
dur_cat = dur_onnx
|
||||
else:
|
||||
silence = np.zeros(
|
||||
(1, int(silence_duration * self.sample_rate)), dtype=np.float32
|
||||
)
|
||||
wav_cat = np.concatenate([wav_cat, silence, wav], axis=1)
|
||||
dur_cat += dur_onnx + silence_duration
|
||||
return wav_cat, dur_cat
|
||||
|
||||
def batch(
|
||||
self,
|
||||
text_list: list[str],
|
||||
lang_list: list[str],
|
||||
style: Style,
|
||||
total_step: int,
|
||||
speed: float = 1.05,
|
||||
) -> tuple[np.ndarray, np.ndarray]:
|
||||
return self._infer(text_list, lang_list, style, total_step, speed)
|
||||
|
||||
|
||||
def length_to_mask(lengths: np.ndarray, max_len: Optional[int] = None) -> np.ndarray:
|
||||
"""
|
||||
Convert lengths to binary mask.
|
||||
|
||||
Args:
|
||||
lengths: (B,)
|
||||
max_len: int
|
||||
|
||||
Returns:
|
||||
mask: (B, 1, max_len)
|
||||
"""
|
||||
max_len = max_len or lengths.max()
|
||||
ids = np.arange(0, max_len)
|
||||
mask = (ids < np.expand_dims(lengths, axis=1)).astype(np.float32)
|
||||
return mask.reshape(-1, 1, max_len)
|
||||
|
||||
|
||||
def get_latent_mask(
|
||||
wav_lengths: np.ndarray, base_chunk_size: int, chunk_compress_factor: int
|
||||
) -> np.ndarray:
|
||||
latent_size = base_chunk_size * chunk_compress_factor
|
||||
latent_lengths = (wav_lengths + latent_size - 1) // latent_size
|
||||
latent_mask = length_to_mask(latent_lengths)
|
||||
return latent_mask
|
||||
|
||||
|
||||
def load_onnx(
|
||||
onnx_path: str, opts: ort.SessionOptions, providers: list[str]
|
||||
) -> ort.InferenceSession:
|
||||
return ort.InferenceSession(onnx_path, sess_options=opts, providers=providers)
|
||||
|
||||
|
||||
def load_onnx_all(
|
||||
onnx_dir: str, opts: ort.SessionOptions, providers: list[str]
|
||||
) -> tuple[
|
||||
ort.InferenceSession,
|
||||
ort.InferenceSession,
|
||||
ort.InferenceSession,
|
||||
ort.InferenceSession,
|
||||
]:
|
||||
dp_onnx_path = os.path.join(onnx_dir, "duration_predictor.onnx")
|
||||
text_enc_onnx_path = os.path.join(onnx_dir, "text_encoder.onnx")
|
||||
vector_est_onnx_path = os.path.join(onnx_dir, "vector_estimator.onnx")
|
||||
vocoder_onnx_path = os.path.join(onnx_dir, "vocoder.onnx")
|
||||
|
||||
dp_ort = load_onnx(dp_onnx_path, opts, providers)
|
||||
text_enc_ort = load_onnx(text_enc_onnx_path, opts, providers)
|
||||
vector_est_ort = load_onnx(vector_est_onnx_path, opts, providers)
|
||||
vocoder_ort = load_onnx(vocoder_onnx_path, opts, providers)
|
||||
return dp_ort, text_enc_ort, vector_est_ort, vocoder_ort
|
||||
|
||||
|
||||
def load_cfgs(onnx_dir: str) -> dict:
|
||||
cfg_path = os.path.join(onnx_dir, "tts.json")
|
||||
with open(cfg_path, "r") as f:
|
||||
cfgs = json.load(f)
|
||||
return cfgs
|
||||
|
||||
|
||||
def load_text_processor(onnx_dir: str) -> UnicodeProcessor:
|
||||
unicode_indexer_path = os.path.join(onnx_dir, "unicode_indexer.json")
|
||||
text_processor = UnicodeProcessor(unicode_indexer_path)
|
||||
return text_processor
|
||||
|
||||
|
||||
def load_text_to_speech(onnx_dir: str, use_gpu: bool = False) -> TextToSpeech:
|
||||
opts = ort.SessionOptions()
|
||||
if use_gpu:
|
||||
raise NotImplementedError("GPU mode is not fully tested")
|
||||
else:
|
||||
providers = ["CPUExecutionProvider"]
|
||||
print("Using CPU for inference")
|
||||
cfgs = load_cfgs(onnx_dir)
|
||||
dp_ort, text_enc_ort, vector_est_ort, vocoder_ort = load_onnx_all(
|
||||
onnx_dir, opts, providers
|
||||
)
|
||||
text_processor = load_text_processor(onnx_dir)
|
||||
return TextToSpeech(
|
||||
cfgs, text_processor, dp_ort, text_enc_ort, vector_est_ort, vocoder_ort
|
||||
)
|
||||
|
||||
|
||||
def load_voice_style(voice_style_paths: list[str], verbose: bool = False) -> Style:
|
||||
bsz = len(voice_style_paths)
|
||||
|
||||
# Read first file to get dimensions
|
||||
with open(voice_style_paths[0], "r") as f:
|
||||
first_style = json.load(f)
|
||||
ttl_dims = first_style["style_ttl"]["dims"]
|
||||
dp_dims = first_style["style_dp"]["dims"]
|
||||
|
||||
# Pre-allocate arrays with full batch size
|
||||
ttl_style = np.zeros([bsz, ttl_dims[1], ttl_dims[2]], dtype=np.float32)
|
||||
dp_style = np.zeros([bsz, dp_dims[1], dp_dims[2]], dtype=np.float32)
|
||||
|
||||
# Fill in the data
|
||||
for i, voice_style_path in enumerate(voice_style_paths):
|
||||
with open(voice_style_path, "r") as f:
|
||||
voice_style = json.load(f)
|
||||
|
||||
ttl_data = np.array(
|
||||
voice_style["style_ttl"]["data"], dtype=np.float32
|
||||
).flatten()
|
||||
ttl_style[i] = ttl_data.reshape(ttl_dims[1], ttl_dims[2])
|
||||
|
||||
dp_data = np.array(voice_style["style_dp"]["data"], dtype=np.float32).flatten()
|
||||
dp_style[i] = dp_data.reshape(dp_dims[1], dp_dims[2])
|
||||
|
||||
if verbose:
|
||||
print(f"Loaded {bsz} voice styles")
|
||||
return Style(ttl_style, dp_style)
|
||||
|
||||
|
||||
@contextmanager
|
||||
def timer(name: str):
|
||||
start = time.time()
|
||||
print(f"{name}...")
|
||||
yield
|
||||
print(f" -> {name} completed in {time.time() - start:.2f} sec")
|
||||
|
||||
|
||||
def sanitize_filename(text: str, max_len: int) -> str:
|
||||
"""Sanitize filename by replacing non-alphanumeric characters with underscores (supports Unicode)"""
|
||||
import re
|
||||
|
||||
prefix = text[:max_len]
|
||||
# \w matches Unicode word characters (letters, digits, underscore) with re.UNICODE
|
||||
# We replace non-word characters except keeping existing underscores
|
||||
return re.sub(r"[^\w]", "_", prefix, flags=re.UNICODE)
|
||||
|
||||
|
||||
def chunk_text(text: str, max_len: int = 300) -> list[str]:
|
||||
"""
|
||||
Split text into chunks by paragraphs and sentences.
|
||||
|
||||
Args:
|
||||
text: Input text to chunk
|
||||
max_len: Maximum length of each chunk (default: 300)
|
||||
|
||||
Returns:
|
||||
List of text chunks
|
||||
"""
|
||||
import re
|
||||
|
||||
# Split by paragraph (two or more newlines)
|
||||
paragraphs = [p.strip() for p in re.split(r"\n\s*\n+", text.strip()) if p.strip()]
|
||||
|
||||
chunks = []
|
||||
|
||||
for paragraph in paragraphs:
|
||||
paragraph = paragraph.strip()
|
||||
if not paragraph:
|
||||
continue
|
||||
|
||||
# Split by sentence boundaries (period, question mark, exclamation mark followed by space)
|
||||
# But exclude common abbreviations like Mr., Mrs., Dr., etc. and single capital letters like F.
|
||||
pattern = r"(?<!Mr\.)(?<!Mrs\.)(?<!Ms\.)(?<!Dr\.)(?<!Prof\.)(?<!Sr\.)(?<!Jr\.)(?<!Ph\.D\.)(?<!etc\.)(?<!e\.g\.)(?<!i\.e\.)(?<!vs\.)(?<!Inc\.)(?<!Ltd\.)(?<!Co\.)(?<!Corp\.)(?<!St\.)(?<!Ave\.)(?<!Blvd\.)(?<!\b[A-Z]\.)(?<=[.!?])\s+"
|
||||
sentences = re.split(pattern, paragraph)
|
||||
|
||||
current_chunk = ""
|
||||
|
||||
for sentence in sentences:
|
||||
if len(current_chunk) + len(sentence) + 1 <= max_len:
|
||||
current_chunk += (" " if current_chunk else "") + sentence
|
||||
else:
|
||||
if current_chunk:
|
||||
chunks.append(current_chunk.strip())
|
||||
current_chunk = sentence
|
||||
|
||||
if current_chunk:
|
||||
chunks.append(current_chunk.strip())
|
||||
|
||||
return chunks
|
||||
20
py/pyproject.toml
Normal file
@@ -0,0 +1,20 @@
|
||||
[project]
|
||||
name = "tts-onnx"
|
||||
version = "1.0.0"
|
||||
description = "TTS ONNX Inference"
|
||||
requires-python = ">=3.10"
|
||||
dependencies = [
|
||||
"onnxruntime==1.23.1",
|
||||
"numpy>=1.26.0",
|
||||
"soundfile>=0.12.1",
|
||||
"librosa>=0.10.0",
|
||||
"PyYAML>=6.0",
|
||||
]
|
||||
|
||||
[tool.setuptools]
|
||||
py-modules = []
|
||||
|
||||
[build-system]
|
||||
requires = ["setuptools"]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
||||
5
py/requirements.txt
Normal file
@@ -0,0 +1,5 @@
|
||||
onnxruntime==1.23.1
|
||||
numpy>=1.26.0
|
||||
soundfile>=0.12.1
|
||||
librosa>=0.10.0
|
||||
PyYAML>=6.0
|
||||
1142
py/uv.lock
generated
Normal file
21
rust/.gitignore
vendored
Normal file
@@ -0,0 +1,21 @@
|
||||
# Rust build artifacts
|
||||
/target/
|
||||
Cargo.lock
|
||||
|
||||
# Output directory
|
||||
/results/
|
||||
|
||||
# IDE
|
||||
.vscode/
|
||||
.idea/
|
||||
*.swp
|
||||
*.swo
|
||||
*~
|
||||
|
||||
# OS
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
|
||||
# Debug
|
||||
*.pdb
|
||||
|
||||
44
rust/Cargo.toml
Normal file
@@ -0,0 +1,44 @@
|
||||
[package]
|
||||
name = "supertonic-tts"
|
||||
version = "0.1.0"
|
||||
edition = "2021"
|
||||
|
||||
[dependencies]
|
||||
# ONNX Runtime
|
||||
ort = "2.0.0-rc.7"
|
||||
|
||||
# Array processing (like NumPy)
|
||||
ndarray = { version = "0.16", features = ["rayon"] }
|
||||
rand = "0.8"
|
||||
rand_distr = "0.4"
|
||||
|
||||
# Parallel processing
|
||||
rayon = "1.10"
|
||||
|
||||
# Audio processing
|
||||
hound = "3.5"
|
||||
rustfft = "6.2"
|
||||
|
||||
# JSON serialization
|
||||
serde = { version = "1.0", features = ["derive"] }
|
||||
serde_json = "1.0"
|
||||
|
||||
# CLI argument parsing
|
||||
clap = { version = "4.5", features = ["derive"] }
|
||||
|
||||
# Error handling
|
||||
anyhow = "1.0"
|
||||
|
||||
# Unicode normalization
|
||||
unicode-normalization = "0.1"
|
||||
|
||||
# Regular expressions
|
||||
regex = "1.10"
|
||||
|
||||
# System calls
|
||||
libc = "0.2"
|
||||
|
||||
[[bin]]
|
||||
name = "example_onnx"
|
||||
path = "src/example_onnx.rs"
|
||||
|
||||
146
rust/README.md
Normal file
@@ -0,0 +1,146 @@
|
||||
# TTS ONNX Inference Examples
|
||||
|
||||
This guide provides examples for running TTS inference using Rust.
|
||||
|
||||
## 📰 Update News
|
||||
|
||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
||||
|
||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||
|
||||
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
|
||||
|
||||
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
|
||||
|
||||
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
|
||||
|
||||
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
|
||||
|
||||
## Installation
|
||||
|
||||
This project uses [Cargo](https://doc.rust-lang.org/cargo/) for package management.
|
||||
|
||||
### Install Rust (if not already installed)
|
||||
```bash
|
||||
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
|
||||
```
|
||||
|
||||
### Build the project
|
||||
```bash
|
||||
cargo build --release
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
You can run the inference in two ways:
|
||||
1. **Using cargo run** (builds if needed, then runs)
|
||||
2. **Direct binary execution** (faster if already built)
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
# Using cargo run
|
||||
cargo run --release --bin example_onnx
|
||||
|
||||
# Or directly execute the built binary (faster)
|
||||
./target/release/example_onnx
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
# Using cargo run
|
||||
cargo run --release --bin example_onnx -- \
|
||||
--batch \
|
||||
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
|
||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
|
||||
--lang en,ko
|
||||
|
||||
# Or using the binary directly
|
||||
./target/release/example_onnx \
|
||||
--batch \
|
||||
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
|
||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
|
||||
--lang en,ko
|
||||
```
|
||||
|
||||
This will:
|
||||
- Generate speech for 2 different voice-text-language pairs
|
||||
- Use male voice (M1.json) for the first text in English
|
||||
- Use female voice (F1.json) for the second text in Korean
|
||||
- Process both samples in a single batch
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
# Using cargo run
|
||||
cargo run --release --bin example_onnx -- \
|
||||
--total-step 10 \
|
||||
--voice-style assets/voice_styles/M1.json \
|
||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
|
||||
# Or using the binary directly
|
||||
./target/release/example_onnx \
|
||||
--total-step 10 \
|
||||
--voice-style assets/voice_styles/M1.json \
|
||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
### Example 4: Long-Form Inference
|
||||
The system automatically chunks long texts into manageable segments, synthesizes each segment separately, and concatenates them with natural pauses (0.3 seconds by default) into a single audio file. This happens by default when you don't use the `--batch` flag:
|
||||
|
||||
```bash
|
||||
# Using cargo run
|
||||
cargo run --release --bin example_onnx -- \
|
||||
--voice-style assets/voice_styles/M1.json \
|
||||
--text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
|
||||
|
||||
# Or using the binary directly
|
||||
./target/release/example_onnx \
|
||||
--voice-style assets/voice_styles/M1.json \
|
||||
--text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Automatically split the text into chunks based on paragraph and sentence boundaries
|
||||
- Synthesize each chunk separately
|
||||
- Add 0.3 seconds of silence between chunks for natural pauses
|
||||
- Concatenate all chunks into a single audio file
|
||||
|
||||
**Note**: Automatic text chunking is disabled when using `--batch` mode. In batch mode, each text is processed as-is without chunking.
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
|
||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s), comma-separated |
|
||||
| `--text` | str+ | (long default text) | Text(s) to synthesize, pipe-separated |
|
||||
| `--lang` | str+ | `en` | Language(s) for synthesis, comma-separated (en, ko, es, pt, fr) |
|
||||
| `--save-dir` | str | `results` | Output directory |
|
||||
| `--batch` | flag | False | Enable batch mode (multiple text-style pairs, disables automatic chunking) |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Multilingual Support**: Use `--lang` to specify the language for each text. Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
|
||||
- **Batch Processing**: When using `--batch`, the number of `--voice-style`, `--text`, and `--lang` entries must match
|
||||
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
|
||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||
- **GPU Support**: GPU mode is not supported yet
|
||||
- **Known Issues**: On some platforms (especially macOS), there might be a mutex cleanup warning during exit. This is a known ONNX Runtime issue and doesn't affect functionality. The implementation uses `libc::_exit()` and `mem::forget()` to bypass this issue.
|
||||
|
||||
|
||||
144
rust/src/example_onnx.rs
Normal file
@@ -0,0 +1,144 @@
|
||||
use anyhow::Result;
|
||||
use clap::Parser;
|
||||
use std::path::PathBuf;
|
||||
use std::fs;
|
||||
use std::mem;
|
||||
|
||||
mod helper;
|
||||
|
||||
use helper::{
|
||||
load_text_to_speech, load_voice_style, timer, write_wav_file, sanitize_filename,
|
||||
};
|
||||
|
||||
#[derive(Parser, Debug)]
|
||||
#[command(name = "TTS ONNX Inference")]
|
||||
#[command(about = "TTS Inference with ONNX Runtime (Rust)", long_about = None)]
|
||||
struct Args {
|
||||
/// Use GPU for inference (default: CPU)
|
||||
#[arg(long, default_value = "false")]
|
||||
use_gpu: bool,
|
||||
|
||||
/// Path to ONNX model directory
|
||||
#[arg(long, default_value = "assets/onnx")]
|
||||
onnx_dir: String,
|
||||
|
||||
/// Number of denoising steps
|
||||
#[arg(long, default_value = "5")]
|
||||
total_step: usize,
|
||||
|
||||
/// Speech speed factor (higher = faster)
|
||||
#[arg(long, default_value = "1.05")]
|
||||
speed: f32,
|
||||
|
||||
/// Number of times to generate
|
||||
#[arg(long, default_value = "4")]
|
||||
n_test: usize,
|
||||
|
||||
/// Voice style file path(s)
|
||||
#[arg(long, value_delimiter = ',', default_values_t = vec!["assets/voice_styles/M1.json".to_string()])]
|
||||
voice_style: Vec<String>,
|
||||
|
||||
/// Text(s) to synthesize
|
||||
#[arg(long, value_delimiter = '|', default_values_t = vec!["This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.".to_string()])]
|
||||
text: Vec<String>,
|
||||
|
||||
/// Language(s) for synthesis (en, ko, es, pt, fr)
|
||||
#[arg(long, value_delimiter = ',', default_values_t = vec!["en".to_string()])]
|
||||
lang: Vec<String>,
|
||||
|
||||
/// Output directory
|
||||
#[arg(long, default_value = "results")]
|
||||
save_dir: String,
|
||||
|
||||
/// Enable batch mode (multiple text-style pairs)
|
||||
#[arg(long, default_value = "false")]
|
||||
batch: bool,
|
||||
}
|
||||
|
||||
fn main() -> Result<()> {
|
||||
println!("=== TTS Inference with ONNX Runtime (Rust) ===\n");
|
||||
|
||||
// --- 1. Parse arguments --- //
|
||||
let args = Args::parse();
|
||||
let total_step = args.total_step;
|
||||
let speed = args.speed;
|
||||
let n_test = args.n_test;
|
||||
let voice_style_paths = &args.voice_style;
|
||||
let text_list = &args.text;
|
||||
let lang_list = &args.lang;
|
||||
let save_dir = &args.save_dir;
|
||||
let batch = args.batch;
|
||||
|
||||
if batch {
|
||||
if voice_style_paths.len() != text_list.len() {
|
||||
anyhow::bail!(
|
||||
"Number of voice styles ({}) must match number of texts ({})",
|
||||
voice_style_paths.len(),
|
||||
text_list.len()
|
||||
);
|
||||
}
|
||||
if lang_list.len() != text_list.len() {
|
||||
anyhow::bail!(
|
||||
"Number of languages ({}) must match number of texts ({})",
|
||||
lang_list.len(),
|
||||
text_list.len()
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
let bsz = voice_style_paths.len();
|
||||
|
||||
// --- 2. Load TTS components --- //
|
||||
let mut text_to_speech = load_text_to_speech(&args.onnx_dir, args.use_gpu)?;
|
||||
|
||||
// --- 3. Load voice styles --- //
|
||||
let style = load_voice_style(voice_style_paths, true)?;
|
||||
|
||||
// --- 4. Synthesize speech --- //
|
||||
fs::create_dir_all(save_dir)?;
|
||||
|
||||
for n in 0..n_test {
|
||||
println!("\n[{}/{}] Starting synthesis...", n + 1, n_test);
|
||||
|
||||
let (wav, duration) = if batch {
|
||||
timer("Generating speech from text", || {
|
||||
text_to_speech.batch(text_list, lang_list, &style, total_step, speed)
|
||||
})?
|
||||
} else {
|
||||
let (w, d) = timer("Generating speech from text", || {
|
||||
text_to_speech.call(&text_list[0], &lang_list[0], &style, total_step, speed, 0.3)
|
||||
})?;
|
||||
(w, vec![d])
|
||||
};
|
||||
|
||||
// Save outputs
|
||||
for i in 0..bsz {
|
||||
let fname = format!("{}_{}.wav", sanitize_filename(&text_list[i], 20), n + 1);
|
||||
let wav_slice = if batch {
|
||||
let wav_len = wav.len() / bsz;
|
||||
let actual_len = (text_to_speech.sample_rate as f32 * duration[i]) as usize;
|
||||
let wav_start = i * wav_len;
|
||||
let wav_end = wav_start + actual_len.min(wav_len);
|
||||
&wav[wav_start..wav_end]
|
||||
} else {
|
||||
// For non-batch mode, wav is a single concatenated audio
|
||||
let actual_len = (text_to_speech.sample_rate as f32 * duration[0]) as usize;
|
||||
&wav[..actual_len.min(wav.len())]
|
||||
};
|
||||
|
||||
let output_path = PathBuf::from(save_dir).join(&fname);
|
||||
write_wav_file(&output_path, wav_slice, text_to_speech.sample_rate)?;
|
||||
println!("Saved: {}", output_path.display());
|
||||
}
|
||||
}
|
||||
|
||||
println!("\n=== Synthesis completed successfully! ===");
|
||||
|
||||
// Prevent ONNX Runtime sessions from being dropped, which causes mutex cleanup issues
|
||||
mem::forget(text_to_speech);
|
||||
|
||||
// Use _exit to bypass all cleanup handlers and avoid ONNX Runtime mutex issues on macOS
|
||||
unsafe {
|
||||
libc::_exit(0);
|
||||
}
|
||||
}
|
||||
838
rust/src/helper.rs
Normal file
@@ -0,0 +1,838 @@
|
||||
// ============================================================================
|
||||
// TTS Helper Module - All utility functions and structures
|
||||
// ============================================================================
|
||||
|
||||
use ndarray::{Array, Array3};
|
||||
use serde::{Deserialize, Serialize};
|
||||
use serde_json;
|
||||
use std::fs::File;
|
||||
use std::io::BufReader;
|
||||
use std::path::Path;
|
||||
use anyhow::{Result, Context, bail};
|
||||
use unicode_normalization::UnicodeNormalization;
|
||||
use hound::{WavWriter, WavSpec, SampleFormat};
|
||||
use rand_distr::{Distribution, Normal};
|
||||
use regex::Regex;
|
||||
|
||||
// Available languages for multilingual TTS
|
||||
pub const AVAILABLE_LANGS: &[&str] = &["en", "ko", "es", "pt", "fr"];
|
||||
|
||||
pub fn is_valid_lang(lang: &str) -> bool {
|
||||
AVAILABLE_LANGS.contains(&lang)
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Configuration Structures
|
||||
// ============================================================================
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct Config {
|
||||
pub ae: AEConfig,
|
||||
pub ttl: TTLConfig,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct AEConfig {
|
||||
pub sample_rate: i32,
|
||||
pub base_chunk_size: i32,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct TTLConfig {
|
||||
pub chunk_compress_factor: i32,
|
||||
pub latent_dim: i32,
|
||||
}
|
||||
|
||||
/// Load configuration from JSON file
|
||||
pub fn load_cfgs<P: AsRef<Path>>(onnx_dir: P) -> Result<Config> {
|
||||
let cfg_path = onnx_dir.as_ref().join("tts.json");
|
||||
let file = File::open(cfg_path)?;
|
||||
let reader = BufReader::new(file);
|
||||
let cfgs: Config = serde_json::from_reader(reader)?;
|
||||
Ok(cfgs)
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Voice Style Data Structure
|
||||
// ============================================================================
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct VoiceStyleData {
|
||||
pub style_ttl: StyleComponent,
|
||||
pub style_dp: StyleComponent,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct StyleComponent {
|
||||
pub data: Vec<Vec<Vec<f32>>>,
|
||||
pub dims: Vec<usize>,
|
||||
#[serde(rename = "type")]
|
||||
pub dtype: String,
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Unicode Text Processor
|
||||
// ============================================================================
|
||||
|
||||
pub struct UnicodeProcessor {
|
||||
indexer: Vec<i64>,
|
||||
}
|
||||
|
||||
impl UnicodeProcessor {
|
||||
pub fn new<P: AsRef<Path>>(unicode_indexer_json_path: P) -> Result<Self> {
|
||||
let file = File::open(unicode_indexer_json_path)?;
|
||||
let reader = BufReader::new(file);
|
||||
let indexer: Vec<i64> = serde_json::from_reader(reader)?;
|
||||
Ok(UnicodeProcessor { indexer })
|
||||
}
|
||||
|
||||
pub fn call(&self, text_list: &[String], lang_list: &[String]) -> Result<(Vec<Vec<i64>>, Array3<f32>)> {
|
||||
let mut processed_texts: Vec<String> = Vec::new();
|
||||
for (text, lang) in text_list.iter().zip(lang_list.iter()) {
|
||||
processed_texts.push(preprocess_text(text, lang)?);
|
||||
}
|
||||
|
||||
let text_ids_lengths: Vec<usize> = processed_texts
|
||||
.iter()
|
||||
.map(|t| t.chars().count())
|
||||
.collect();
|
||||
|
||||
let max_len = *text_ids_lengths.iter().max().unwrap_or(&0);
|
||||
|
||||
let mut text_ids = Vec::new();
|
||||
for text in &processed_texts {
|
||||
let mut row = vec![0i64; max_len];
|
||||
let unicode_vals = text_to_unicode_values(text);
|
||||
for (j, &val) in unicode_vals.iter().enumerate() {
|
||||
if val < self.indexer.len() {
|
||||
row[j] = self.indexer[val];
|
||||
} else {
|
||||
row[j] = -1;
|
||||
}
|
||||
}
|
||||
text_ids.push(row);
|
||||
}
|
||||
|
||||
let text_mask = get_text_mask(&text_ids_lengths);
|
||||
|
||||
Ok((text_ids, text_mask))
|
||||
}
|
||||
}
|
||||
|
||||
pub fn preprocess_text(text: &str, lang: &str) -> Result<String> {
|
||||
// TODO: Need advanced normalizer for better performance
|
||||
let mut text: String = text.nfkd().collect();
|
||||
|
||||
// Remove emojis (wide Unicode range)
|
||||
let emoji_pattern = Regex::new(r"[\x{1F600}-\x{1F64F}\x{1F300}-\x{1F5FF}\x{1F680}-\x{1F6FF}\x{1F700}-\x{1F77F}\x{1F780}-\x{1F7FF}\x{1F800}-\x{1F8FF}\x{1F900}-\x{1F9FF}\x{1FA00}-\x{1FA6F}\x{1FA70}-\x{1FAFF}\x{2600}-\x{26FF}\x{2700}-\x{27BF}\x{1F1E6}-\x{1F1FF}]+").unwrap();
|
||||
text = emoji_pattern.replace_all(&text, "").to_string();
|
||||
|
||||
// Replace various dashes and symbols
|
||||
let replacements = [
|
||||
("–", "-"), // en dash
|
||||
("‑", "-"), // non-breaking hyphen
|
||||
("—", "-"), // em dash
|
||||
("_", " "), // underscore
|
||||
("\u{201C}", "\""), // left double quote
|
||||
("\u{201D}", "\""), // right double quote
|
||||
("\u{2018}", "'"), // left single quote
|
||||
("\u{2019}", "'"), // right single quote
|
||||
("´", "'"), // acute accent
|
||||
("`", "'"), // grave accent
|
||||
("[", " "), // left bracket
|
||||
("]", " "), // right bracket
|
||||
("|", " "), // vertical bar
|
||||
("/", " "), // slash
|
||||
("#", " "), // hash
|
||||
("→", " "), // right arrow
|
||||
("←", " "), // left arrow
|
||||
];
|
||||
|
||||
for (from, to) in &replacements {
|
||||
text = text.replace(from, to);
|
||||
}
|
||||
|
||||
// Remove special symbols
|
||||
let special_symbols = ["♥", "☆", "♡", "©", "\\"];
|
||||
for symbol in &special_symbols {
|
||||
text = text.replace(symbol, "");
|
||||
}
|
||||
|
||||
// Replace known expressions
|
||||
let expr_replacements = [
|
||||
("@", " at "),
|
||||
("e.g.,", "for example, "),
|
||||
("i.e.,", "that is, "),
|
||||
];
|
||||
|
||||
for (from, to) in &expr_replacements {
|
||||
text = text.replace(from, to);
|
||||
}
|
||||
|
||||
// Fix spacing around punctuation
|
||||
text = Regex::new(r" ,").unwrap().replace_all(&text, ",").to_string();
|
||||
text = Regex::new(r" \.").unwrap().replace_all(&text, ".").to_string();
|
||||
text = Regex::new(r" !").unwrap().replace_all(&text, "!").to_string();
|
||||
text = Regex::new(r" \?").unwrap().replace_all(&text, "?").to_string();
|
||||
text = Regex::new(r" ;").unwrap().replace_all(&text, ";").to_string();
|
||||
text = Regex::new(r" :").unwrap().replace_all(&text, ":").to_string();
|
||||
text = Regex::new(r" '").unwrap().replace_all(&text, "'").to_string();
|
||||
|
||||
// Remove duplicate quotes
|
||||
while text.contains("\"\"") {
|
||||
text = text.replace("\"\"", "\"");
|
||||
}
|
||||
while text.contains("''") {
|
||||
text = text.replace("''", "'");
|
||||
}
|
||||
while text.contains("``") {
|
||||
text = text.replace("``", "`");
|
||||
}
|
||||
|
||||
// Remove extra spaces
|
||||
text = Regex::new(r"\s+").unwrap().replace_all(&text, " ").to_string();
|
||||
text = text.trim().to_string();
|
||||
|
||||
// If text doesn't end with punctuation, quotes, or closing brackets, add a period
|
||||
if !text.is_empty() {
|
||||
let ends_with_punct = Regex::new(r#"[.!?;:,'"\u{201C}\u{201D}\u{2018}\u{2019})\]}…。」』】〉》›»]$"#).unwrap();
|
||||
if !ends_with_punct.is_match(&text) {
|
||||
text.push('.');
|
||||
}
|
||||
}
|
||||
|
||||
// Validate language
|
||||
if !is_valid_lang(lang) {
|
||||
bail!("Invalid language: {}. Available: {:?}", lang, AVAILABLE_LANGS);
|
||||
}
|
||||
|
||||
// Wrap text with language tags
|
||||
text = format!("<{}>{}</{}>", lang, text, lang);
|
||||
|
||||
Ok(text)
|
||||
}
|
||||
|
||||
pub fn text_to_unicode_values(text: &str) -> Vec<usize> {
|
||||
text.chars().map(|c| c as usize).collect()
|
||||
}
|
||||
|
||||
pub fn length_to_mask(lengths: &[usize], max_len: Option<usize>) -> Array3<f32> {
|
||||
let bsz = lengths.len();
|
||||
let max_len = max_len.unwrap_or_else(|| *lengths.iter().max().unwrap_or(&0));
|
||||
|
||||
let mut mask = Array3::<f32>::zeros((bsz, 1, max_len));
|
||||
for (i, &len) in lengths.iter().enumerate() {
|
||||
for j in 0..len.min(max_len) {
|
||||
mask[[i, 0, j]] = 1.0;
|
||||
}
|
||||
}
|
||||
mask
|
||||
}
|
||||
|
||||
pub fn get_text_mask(text_ids_lengths: &[usize]) -> Array3<f32> {
|
||||
let max_len = *text_ids_lengths.iter().max().unwrap_or(&0);
|
||||
length_to_mask(text_ids_lengths, Some(max_len))
|
||||
}
|
||||
|
||||
/// Sample noisy latent from normal distribution and apply mask
|
||||
pub fn sample_noisy_latent(
|
||||
duration: &[f32],
|
||||
sample_rate: i32,
|
||||
base_chunk_size: i32,
|
||||
chunk_compress: i32,
|
||||
latent_dim: i32,
|
||||
) -> (Array3<f32>, Array3<f32>) {
|
||||
let bsz = duration.len();
|
||||
let max_dur = duration.iter().fold(0.0f32, |a, &b| a.max(b));
|
||||
|
||||
let wav_len_max = (max_dur * sample_rate as f32) as usize;
|
||||
let wav_lengths: Vec<usize> = duration
|
||||
.iter()
|
||||
.map(|&d| (d * sample_rate as f32) as usize)
|
||||
.collect();
|
||||
|
||||
let chunk_size = (base_chunk_size * chunk_compress) as usize;
|
||||
let latent_len = (wav_len_max + chunk_size - 1) / chunk_size;
|
||||
let latent_dim_val = (latent_dim * chunk_compress) as usize;
|
||||
|
||||
let mut noisy_latent = Array3::<f32>::zeros((bsz, latent_dim_val, latent_len));
|
||||
|
||||
let normal = Normal::new(0.0, 1.0).unwrap();
|
||||
let mut rng = rand::thread_rng();
|
||||
|
||||
for b in 0..bsz {
|
||||
for d in 0..latent_dim_val {
|
||||
for t in 0..latent_len {
|
||||
noisy_latent[[b, d, t]] = normal.sample(&mut rng);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let latent_lengths: Vec<usize> = wav_lengths
|
||||
.iter()
|
||||
.map(|&len| (len + chunk_size - 1) / chunk_size)
|
||||
.collect();
|
||||
|
||||
let latent_mask = length_to_mask(&latent_lengths, Some(latent_len));
|
||||
|
||||
// Apply mask
|
||||
for b in 0..bsz {
|
||||
for d in 0..latent_dim_val {
|
||||
for t in 0..latent_len {
|
||||
noisy_latent[[b, d, t]] *= latent_mask[[b, 0, t]];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
(noisy_latent, latent_mask)
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// WAV File I/O
|
||||
// ============================================================================
|
||||
|
||||
pub fn write_wav_file<P: AsRef<Path>>(
|
||||
filename: P,
|
||||
audio_data: &[f32],
|
||||
sample_rate: i32,
|
||||
) -> Result<()> {
|
||||
let spec = WavSpec {
|
||||
channels: 1,
|
||||
sample_rate: sample_rate as u32,
|
||||
bits_per_sample: 16,
|
||||
sample_format: SampleFormat::Int,
|
||||
};
|
||||
|
||||
let mut writer = WavWriter::create(filename, spec)?;
|
||||
|
||||
for &sample in audio_data {
|
||||
let clamped = sample.max(-1.0).min(1.0);
|
||||
let val = (clamped * 32767.0) as i16;
|
||||
writer.write_sample(val)?;
|
||||
}
|
||||
|
||||
writer.finalize()?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Text Chunking
|
||||
// ============================================================================
|
||||
|
||||
const MAX_CHUNK_LENGTH: usize = 300;
|
||||
|
||||
const ABBREVIATIONS: &[&str] = &[
|
||||
"Dr.", "Mr.", "Mrs.", "Ms.", "Prof.", "Sr.", "Jr.",
|
||||
"St.", "Ave.", "Rd.", "Blvd.", "Dept.", "Inc.", "Ltd.",
|
||||
"Co.", "Corp.", "etc.", "vs.", "i.e.", "e.g.", "Ph.D.",
|
||||
];
|
||||
|
||||
pub fn chunk_text(text: &str, max_len: Option<usize>) -> Vec<String> {
|
||||
let max_len = max_len.unwrap_or(MAX_CHUNK_LENGTH);
|
||||
let text = text.trim();
|
||||
|
||||
if text.is_empty() {
|
||||
return vec![String::new()];
|
||||
}
|
||||
|
||||
// Split by paragraphs
|
||||
let para_re = Regex::new(r"\n\s*\n").unwrap();
|
||||
let paragraphs: Vec<&str> = para_re.split(text).collect();
|
||||
let mut chunks = Vec::new();
|
||||
|
||||
for para in paragraphs {
|
||||
let para = para.trim();
|
||||
if para.is_empty() {
|
||||
continue;
|
||||
}
|
||||
|
||||
if para.len() <= max_len {
|
||||
chunks.push(para.to_string());
|
||||
continue;
|
||||
}
|
||||
|
||||
// Split by sentences
|
||||
let sentences = split_sentences(para);
|
||||
let mut current = String::new();
|
||||
let mut current_len = 0;
|
||||
|
||||
for sentence in sentences {
|
||||
let sentence = sentence.trim();
|
||||
if sentence.is_empty() {
|
||||
continue;
|
||||
}
|
||||
|
||||
let sentence_len = sentence.len();
|
||||
if sentence_len > max_len {
|
||||
// If sentence is longer than max_len, split by comma or space
|
||||
if !current.is_empty() {
|
||||
chunks.push(current.trim().to_string());
|
||||
current.clear();
|
||||
current_len = 0;
|
||||
}
|
||||
|
||||
// Try splitting by comma
|
||||
let parts: Vec<&str> = sentence.split(',').collect();
|
||||
for part in parts {
|
||||
let part = part.trim();
|
||||
if part.is_empty() {
|
||||
continue;
|
||||
}
|
||||
|
||||
let part_len = part.len();
|
||||
if part_len > max_len {
|
||||
// Split by space as last resort
|
||||
let words: Vec<&str> = part.split_whitespace().collect();
|
||||
let mut word_chunk = String::new();
|
||||
let mut word_chunk_len = 0;
|
||||
|
||||
for word in words {
|
||||
let word_len = word.len();
|
||||
if word_chunk_len + word_len + 1 > max_len && !word_chunk.is_empty() {
|
||||
chunks.push(word_chunk.trim().to_string());
|
||||
word_chunk.clear();
|
||||
word_chunk_len = 0;
|
||||
}
|
||||
|
||||
if !word_chunk.is_empty() {
|
||||
word_chunk.push(' ');
|
||||
word_chunk_len += 1;
|
||||
}
|
||||
word_chunk.push_str(word);
|
||||
word_chunk_len += word_len;
|
||||
}
|
||||
|
||||
if !word_chunk.is_empty() {
|
||||
chunks.push(word_chunk.trim().to_string());
|
||||
}
|
||||
} else {
|
||||
if current_len + part_len + 1 > max_len && !current.is_empty() {
|
||||
chunks.push(current.trim().to_string());
|
||||
current.clear();
|
||||
current_len = 0;
|
||||
}
|
||||
|
||||
if !current.is_empty() {
|
||||
current.push_str(", ");
|
||||
current_len += 2;
|
||||
}
|
||||
current.push_str(part);
|
||||
current_len += part_len;
|
||||
}
|
||||
}
|
||||
continue;
|
||||
}
|
||||
|
||||
if current_len + sentence_len + 1 > max_len && !current.is_empty() {
|
||||
chunks.push(current.trim().to_string());
|
||||
current.clear();
|
||||
current_len = 0;
|
||||
}
|
||||
|
||||
if !current.is_empty() {
|
||||
current.push(' ');
|
||||
current_len += 1;
|
||||
}
|
||||
current.push_str(sentence);
|
||||
current_len += sentence_len;
|
||||
}
|
||||
|
||||
if !current.is_empty() {
|
||||
chunks.push(current.trim().to_string());
|
||||
}
|
||||
}
|
||||
|
||||
if chunks.is_empty() {
|
||||
vec![String::new()]
|
||||
} else {
|
||||
chunks
|
||||
}
|
||||
}
|
||||
|
||||
fn split_sentences(text: &str) -> Vec<String> {
|
||||
// Rust's regex doesn't support lookbehind, so we use a simpler approach
|
||||
// Split on sentence boundaries and then check if they're abbreviations
|
||||
let re = Regex::new(r"([.!?])\s+").unwrap();
|
||||
|
||||
// Find all matches
|
||||
let matches: Vec<_> = re.find_iter(text).collect();
|
||||
if matches.is_empty() {
|
||||
return vec![text.to_string()];
|
||||
}
|
||||
|
||||
let mut sentences = Vec::new();
|
||||
let mut last_end = 0;
|
||||
|
||||
for m in matches {
|
||||
// Get the text before the punctuation
|
||||
let before_punc = &text[last_end..m.start()];
|
||||
|
||||
// Check if this ends with an abbreviation
|
||||
let mut is_abbrev = false;
|
||||
for abbrev in ABBREVIATIONS {
|
||||
let combined = format!("{}{}", before_punc.trim(), &text[m.start()..m.start()+1]);
|
||||
if combined.ends_with(abbrev) {
|
||||
is_abbrev = true;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
if !is_abbrev {
|
||||
// This is a real sentence boundary
|
||||
sentences.push(text[last_end..m.end()].to_string());
|
||||
last_end = m.end();
|
||||
}
|
||||
}
|
||||
|
||||
// Add the remaining text
|
||||
if last_end < text.len() {
|
||||
sentences.push(text[last_end..].to_string());
|
||||
}
|
||||
|
||||
if sentences.is_empty() {
|
||||
vec![text.to_string()]
|
||||
} else {
|
||||
sentences
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Utility Functions
|
||||
// ============================================================================
|
||||
|
||||
pub fn timer<F, T>(name: &str, f: F) -> Result<T>
|
||||
where
|
||||
F: FnOnce() -> Result<T>,
|
||||
{
|
||||
let start = std::time::Instant::now();
|
||||
println!("{}...", name);
|
||||
let result = f()?;
|
||||
let elapsed = start.elapsed().as_secs_f64();
|
||||
println!(" -> {} completed in {:.2} sec", name, elapsed);
|
||||
Ok(result)
|
||||
}
|
||||
|
||||
pub fn sanitize_filename(text: &str, max_len: usize) -> String {
|
||||
// Take first max_len characters (Unicode code points, not bytes)
|
||||
text.chars()
|
||||
.take(max_len)
|
||||
.map(|c| {
|
||||
// is_alphanumeric() works with all Unicode letters and digits
|
||||
if c.is_alphanumeric() {
|
||||
c
|
||||
} else {
|
||||
'_'
|
||||
}
|
||||
})
|
||||
.collect()
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// ONNX Runtime Integration
|
||||
// ============================================================================
|
||||
|
||||
use ort::{
|
||||
session::Session,
|
||||
value::Value,
|
||||
};
|
||||
|
||||
pub struct Style {
|
||||
pub ttl: Array3<f32>,
|
||||
pub dp: Array3<f32>,
|
||||
}
|
||||
|
||||
pub struct TextToSpeech {
|
||||
cfgs: Config,
|
||||
text_processor: UnicodeProcessor,
|
||||
dp_ort: Session,
|
||||
text_enc_ort: Session,
|
||||
vector_est_ort: Session,
|
||||
vocoder_ort: Session,
|
||||
pub sample_rate: i32,
|
||||
}
|
||||
|
||||
impl TextToSpeech {
|
||||
pub fn new(
|
||||
cfgs: Config,
|
||||
text_processor: UnicodeProcessor,
|
||||
dp_ort: Session,
|
||||
text_enc_ort: Session,
|
||||
vector_est_ort: Session,
|
||||
vocoder_ort: Session,
|
||||
) -> Self {
|
||||
let sample_rate = cfgs.ae.sample_rate;
|
||||
TextToSpeech {
|
||||
cfgs,
|
||||
text_processor,
|
||||
dp_ort,
|
||||
text_enc_ort,
|
||||
vector_est_ort,
|
||||
vocoder_ort,
|
||||
sample_rate,
|
||||
}
|
||||
}
|
||||
|
||||
fn _infer(
|
||||
&mut self,
|
||||
text_list: &[String],
|
||||
lang_list: &[String],
|
||||
style: &Style,
|
||||
total_step: usize,
|
||||
speed: f32,
|
||||
) -> Result<(Vec<f32>, Vec<f32>)> {
|
||||
let bsz = text_list.len();
|
||||
|
||||
// Process text
|
||||
let (text_ids, text_mask) = self.text_processor.call(text_list, lang_list)?;
|
||||
|
||||
let text_ids_array = {
|
||||
let text_ids_shape = (bsz, text_ids[0].len());
|
||||
let mut flat = Vec::new();
|
||||
for row in &text_ids {
|
||||
flat.extend_from_slice(row);
|
||||
}
|
||||
Array::from_shape_vec(text_ids_shape, flat)?
|
||||
};
|
||||
|
||||
let text_ids_value = Value::from_array(text_ids_array)?;
|
||||
let text_mask_value = Value::from_array(text_mask.clone())?;
|
||||
let style_dp_value = Value::from_array(style.dp.clone())?;
|
||||
|
||||
// Predict duration
|
||||
let dp_outputs = self.dp_ort.run(ort::inputs!{
|
||||
"text_ids" => &text_ids_value,
|
||||
"style_dp" => &style_dp_value,
|
||||
"text_mask" => &text_mask_value
|
||||
})?;
|
||||
|
||||
let (_, duration_data) = dp_outputs["duration"].try_extract_tensor::<f32>()?;
|
||||
let mut duration: Vec<f32> = duration_data.to_vec();
|
||||
|
||||
// Apply speed factor to duration
|
||||
for dur in duration.iter_mut() {
|
||||
*dur /= speed;
|
||||
}
|
||||
|
||||
// Encode text
|
||||
let style_ttl_value = Value::from_array(style.ttl.clone())?;
|
||||
let text_enc_outputs = self.text_enc_ort.run(ort::inputs!{
|
||||
"text_ids" => &text_ids_value,
|
||||
"style_ttl" => &style_ttl_value,
|
||||
"text_mask" => &text_mask_value
|
||||
})?;
|
||||
|
||||
let (text_emb_shape, text_emb_data) = text_enc_outputs["text_emb"].try_extract_tensor::<f32>()?;
|
||||
let text_emb = Array3::from_shape_vec(
|
||||
(text_emb_shape[0] as usize, text_emb_shape[1] as usize, text_emb_shape[2] as usize),
|
||||
text_emb_data.to_vec()
|
||||
)?;
|
||||
|
||||
// Sample noisy latent
|
||||
let (mut xt, latent_mask) = sample_noisy_latent(
|
||||
&duration,
|
||||
self.sample_rate,
|
||||
self.cfgs.ae.base_chunk_size,
|
||||
self.cfgs.ttl.chunk_compress_factor,
|
||||
self.cfgs.ttl.latent_dim,
|
||||
);
|
||||
|
||||
// Prepare constant arrays
|
||||
let total_step_array = Array::from_elem(bsz, total_step as f32);
|
||||
|
||||
// Denoising loop
|
||||
for step in 0..total_step {
|
||||
let current_step_array = Array::from_elem(bsz, step as f32);
|
||||
|
||||
let xt_value = Value::from_array(xt.clone())?;
|
||||
let text_emb_value = Value::from_array(text_emb.clone())?;
|
||||
let latent_mask_value = Value::from_array(latent_mask.clone())?;
|
||||
let text_mask_value2 = Value::from_array(text_mask.clone())?;
|
||||
let current_step_value = Value::from_array(current_step_array)?;
|
||||
let total_step_value = Value::from_array(total_step_array.clone())?;
|
||||
|
||||
let vector_est_outputs = self.vector_est_ort.run(ort::inputs!{
|
||||
"noisy_latent" => &xt_value,
|
||||
"text_emb" => &text_emb_value,
|
||||
"style_ttl" => &style_ttl_value,
|
||||
"latent_mask" => &latent_mask_value,
|
||||
"text_mask" => &text_mask_value2,
|
||||
"current_step" => ¤t_step_value,
|
||||
"total_step" => &total_step_value
|
||||
})?;
|
||||
|
||||
let (denoised_shape, denoised_data) = vector_est_outputs["denoised_latent"].try_extract_tensor::<f32>()?;
|
||||
xt = Array3::from_shape_vec(
|
||||
(denoised_shape[0] as usize, denoised_shape[1] as usize, denoised_shape[2] as usize),
|
||||
denoised_data.to_vec()
|
||||
)?;
|
||||
}
|
||||
|
||||
// Generate waveform
|
||||
let final_latent_value = Value::from_array(xt)?;
|
||||
let vocoder_outputs = self.vocoder_ort.run(ort::inputs!{
|
||||
"latent" => &final_latent_value
|
||||
})?;
|
||||
|
||||
let (_, wav_data) = vocoder_outputs["wav_tts"].try_extract_tensor::<f32>()?;
|
||||
let wav: Vec<f32> = wav_data.to_vec();
|
||||
|
||||
Ok((wav, duration))
|
||||
}
|
||||
|
||||
pub fn call(
|
||||
&mut self,
|
||||
text: &str,
|
||||
lang: &str,
|
||||
style: &Style,
|
||||
total_step: usize,
|
||||
speed: f32,
|
||||
silence_duration: f32,
|
||||
) -> Result<(Vec<f32>, f32)> {
|
||||
let max_len = if lang == "ko" { 120 } else { 300 };
|
||||
let chunks = chunk_text(text, Some(max_len));
|
||||
|
||||
let mut wav_cat: Vec<f32> = Vec::new();
|
||||
let mut dur_cat: f32 = 0.0;
|
||||
|
||||
for (i, chunk) in chunks.iter().enumerate() {
|
||||
let (wav, duration) = self._infer(&[chunk.clone()], &[lang.to_string()], style, total_step, speed)?;
|
||||
|
||||
let dur = duration[0];
|
||||
let wav_len = (self.sample_rate as f32 * dur) as usize;
|
||||
let wav_chunk = &wav[..wav_len.min(wav.len())];
|
||||
|
||||
if i == 0 {
|
||||
wav_cat.extend_from_slice(wav_chunk);
|
||||
dur_cat = dur;
|
||||
} else {
|
||||
let silence_len = (silence_duration * self.sample_rate as f32) as usize;
|
||||
let silence = vec![0.0f32; silence_len];
|
||||
|
||||
wav_cat.extend_from_slice(&silence);
|
||||
wav_cat.extend_from_slice(wav_chunk);
|
||||
dur_cat += silence_duration + dur;
|
||||
}
|
||||
}
|
||||
|
||||
Ok((wav_cat, dur_cat))
|
||||
}
|
||||
|
||||
pub fn batch(
|
||||
&mut self,
|
||||
text_list: &[String],
|
||||
lang_list: &[String],
|
||||
style: &Style,
|
||||
total_step: usize,
|
||||
speed: f32,
|
||||
) -> Result<(Vec<f32>, Vec<f32>)> {
|
||||
self._infer(text_list, lang_list, style, total_step, speed)
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Component Loading Functions
|
||||
// ============================================================================
|
||||
|
||||
/// Load voice style from JSON files
|
||||
pub fn load_voice_style(voice_style_paths: &[String], verbose: bool) -> Result<Style> {
|
||||
let bsz = voice_style_paths.len();
|
||||
|
||||
// Read first file to get dimensions
|
||||
let first_file = File::open(&voice_style_paths[0])
|
||||
.context("Failed to open voice style file")?;
|
||||
let first_reader = BufReader::new(first_file);
|
||||
let first_data: VoiceStyleData = serde_json::from_reader(first_reader)?;
|
||||
|
||||
let ttl_dims = &first_data.style_ttl.dims;
|
||||
let dp_dims = &first_data.style_dp.dims;
|
||||
|
||||
let ttl_dim1 = ttl_dims[1];
|
||||
let ttl_dim2 = ttl_dims[2];
|
||||
let dp_dim1 = dp_dims[1];
|
||||
let dp_dim2 = dp_dims[2];
|
||||
|
||||
// Pre-allocate arrays with full batch size
|
||||
let ttl_size = bsz * ttl_dim1 * ttl_dim2;
|
||||
let dp_size = bsz * dp_dim1 * dp_dim2;
|
||||
let mut ttl_flat = vec![0.0f32; ttl_size];
|
||||
let mut dp_flat = vec![0.0f32; dp_size];
|
||||
|
||||
// Fill in the data
|
||||
for (i, path) in voice_style_paths.iter().enumerate() {
|
||||
let file = File::open(path).context("Failed to open voice style file")?;
|
||||
let reader = BufReader::new(file);
|
||||
let data: VoiceStyleData = serde_json::from_reader(reader)?;
|
||||
|
||||
// Flatten TTL data
|
||||
let ttl_offset = i * ttl_dim1 * ttl_dim2;
|
||||
let mut idx = 0;
|
||||
for batch in &data.style_ttl.data {
|
||||
for row in batch {
|
||||
for &val in row {
|
||||
ttl_flat[ttl_offset + idx] = val;
|
||||
idx += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Flatten DP data
|
||||
let dp_offset = i * dp_dim1 * dp_dim2;
|
||||
idx = 0;
|
||||
for batch in &data.style_dp.data {
|
||||
for row in batch {
|
||||
for &val in row {
|
||||
dp_flat[dp_offset + idx] = val;
|
||||
idx += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let ttl_style = Array3::from_shape_vec((bsz, ttl_dim1, ttl_dim2), ttl_flat)?;
|
||||
let dp_style = Array3::from_shape_vec((bsz, dp_dim1, dp_dim2), dp_flat)?;
|
||||
|
||||
if verbose {
|
||||
println!("Loaded {} voice styles\n", bsz);
|
||||
}
|
||||
|
||||
Ok(Style {
|
||||
ttl: ttl_style,
|
||||
dp: dp_style,
|
||||
})
|
||||
}
|
||||
|
||||
/// Load TTS components
|
||||
pub fn load_text_to_speech(onnx_dir: &str, use_gpu: bool) -> Result<TextToSpeech> {
|
||||
if use_gpu {
|
||||
anyhow::bail!("GPU mode is not supported yet");
|
||||
}
|
||||
println!("Using CPU for inference\n");
|
||||
|
||||
let cfgs = load_cfgs(onnx_dir)?;
|
||||
|
||||
let dp_path = format!("{}/duration_predictor.onnx", onnx_dir);
|
||||
let text_enc_path = format!("{}/text_encoder.onnx", onnx_dir);
|
||||
let vector_est_path = format!("{}/vector_estimator.onnx", onnx_dir);
|
||||
let vocoder_path = format!("{}/vocoder.onnx", onnx_dir);
|
||||
|
||||
let dp_ort = Session::builder()?
|
||||
.commit_from_file(&dp_path)?;
|
||||
let text_enc_ort = Session::builder()?
|
||||
.commit_from_file(&text_enc_path)?;
|
||||
let vector_est_ort = Session::builder()?
|
||||
.commit_from_file(&vector_est_path)?;
|
||||
let vocoder_ort = Session::builder()?
|
||||
.commit_from_file(&vocoder_path)?;
|
||||
|
||||
let unicode_indexer_path = format!("{}/unicode_indexer.json", onnx_dir);
|
||||
let text_processor = UnicodeProcessor::new(&unicode_indexer_path)?;
|
||||
|
||||
Ok(TextToSpeech::new(
|
||||
cfgs,
|
||||
text_processor,
|
||||
dp_ort,
|
||||
text_enc_ort,
|
||||
vector_est_ort,
|
||||
vocoder_ort,
|
||||
))
|
||||
}
|
||||
15
swift/.gitignore
vendored
Normal file
@@ -0,0 +1,15 @@
|
||||
# Swift Package Manager
|
||||
.build/
|
||||
.swiftpm/
|
||||
*.xcodeproj
|
||||
*.xcworkspace
|
||||
|
||||
# Build artifacts
|
||||
example_onnx
|
||||
|
||||
# Results
|
||||
results/*.wav
|
||||
|
||||
# macOS
|
||||
.DS_Store
|
||||
|
||||
14
swift/Package.resolved
Normal file
@@ -0,0 +1,14 @@
|
||||
{
|
||||
"pins" : [
|
||||
{
|
||||
"identity" : "onnxruntime-swift-package-manager",
|
||||
"kind" : "remoteSourceControl",
|
||||
"location" : "https://github.com/microsoft/onnxruntime-swift-package-manager.git",
|
||||
"state" : {
|
||||
"revision" : "12ce7374c86944e1f68f3a866d10105d8357f074",
|
||||
"version" : "1.20.0"
|
||||
}
|
||||
}
|
||||
],
|
||||
"version" : 2
|
||||
}
|
||||
22
swift/Package.swift
Normal file
@@ -0,0 +1,22 @@
|
||||
// swift-tools-version: 5.9
|
||||
import PackageDescription
|
||||
|
||||
let package = Package(
|
||||
name: "Supertonic",
|
||||
platforms: [
|
||||
.macOS(.v13)
|
||||
],
|
||||
dependencies: [
|
||||
.package(url: "https://github.com/microsoft/onnxruntime-swift-package-manager.git", from: "1.16.0"),
|
||||
],
|
||||
targets: [
|
||||
.executableTarget(
|
||||
name: "example_onnx",
|
||||
dependencies: [
|
||||
.product(name: "onnxruntime", package: "onnxruntime-swift-package-manager")
|
||||
],
|
||||
path: "Sources"
|
||||
)
|
||||
]
|
||||
)
|
||||
|
||||
122
swift/README.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# TTS ONNX Inference Examples
|
||||
|
||||
This guide provides examples for running TTS inference using `example_onnx`.
|
||||
|
||||
## 📰 Update News
|
||||
|
||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
||||
|
||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||
|
||||
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
|
||||
|
||||
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
|
||||
|
||||
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
|
||||
|
||||
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
|
||||
|
||||
## Installation
|
||||
|
||||
This project uses Swift Package Manager (SPM) for dependency management.
|
||||
|
||||
### Prerequisites
|
||||
- Swift 5.9 or later
|
||||
- macOS 13.0 or later
|
||||
|
||||
### Build the project
|
||||
```bash
|
||||
swift build -c release
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
.build/release/example_onnx
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
.build/release/example_onnx \
|
||||
--batch \
|
||||
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
|
||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
|
||||
--lang en,ko
|
||||
```
|
||||
|
||||
This will:
|
||||
- Generate speech for 2 different voice-text-language triplets
|
||||
- Use male voice (M1.json) for the first English text
|
||||
- Use female voice (F1.json) for the second Korean text
|
||||
- Process both samples in a single batch
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
.build/release/example_onnx \
|
||||
--total-step 10 \
|
||||
--voice-style assets/voice_styles/M1.json \
|
||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
### Example 4: Long-Form Inference
|
||||
The system automatically chunks long texts into manageable segments, synthesizes each segment separately, and concatenates them with natural pauses (0.3 seconds by default) into a single audio file. This happens by default when you don't use the `--batch` flag:
|
||||
|
||||
```bash
|
||||
.build/release/example_onnx \
|
||||
--voice-style assets/voice_styles/M1.json \
|
||||
--text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Automatically split the text into chunks based on paragraph and sentence boundaries
|
||||
- Synthesize each chunk separately
|
||||
- Add 0.3 seconds of silence between chunks for natural pauses
|
||||
- Concatenate all chunks into a single audio file
|
||||
|
||||
**Note**: Automatic text chunking is disabled when using `--batch` mode. In batch mode, each text is processed as-is without chunking.
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
|
||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
|
||||
| `--text` | str+ | (long default text) | Text(s) to synthesize |
|
||||
| `--lang` | str+ | `en` | Language(s) for synthesis (en, ko, es, pt, fr) |
|
||||
| `--save-dir` | str | `results` | Output directory |
|
||||
| `--batch` | flag | False | Enable batch mode (multiple text-style-lang triplets, disables automatic chunking) |
|
||||
|
||||
## Multilingual Support
|
||||
|
||||
Supertonic 2 supports multiple languages. Use the `--lang` argument to specify the language:
|
||||
|
||||
- `en` - English (default)
|
||||
- `ko` - Korean (한국어)
|
||||
- `es` - Spanish (Español)
|
||||
- `pt` - Portuguese (Português)
|
||||
- `fr` - French (Français)
|
||||
|
||||
## Notes
|
||||
|
||||
- **Batch Processing**: When using `--batch`, the number of `--voice-style`, `--text`, and `--lang` entries must match
|
||||
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
|
||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||
- **GPU Support**: GPU mode is not supported yet
|
||||
163
swift/Sources/ExampleONNX.swift
Normal file
@@ -0,0 +1,163 @@
|
||||
import Foundation
|
||||
import OnnxRuntimeBindings
|
||||
|
||||
struct Args {
|
||||
var useGpu: Bool = false
|
||||
var onnxDir: String = "assets/onnx"
|
||||
var totalStep: Int = 5
|
||||
var speed: Float = 1.05
|
||||
var nTest: Int = 4
|
||||
var voiceStyle: [String] = ["assets/voice_styles/M1.json"]
|
||||
var text: [String] = ["This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."]
|
||||
var lang: [String] = ["en"]
|
||||
var saveDir: String = "results"
|
||||
var batch: Bool = false
|
||||
}
|
||||
|
||||
func parseArgs() -> Args {
|
||||
var args = Args()
|
||||
let arguments = CommandLine.arguments
|
||||
|
||||
var i = 1
|
||||
while i < arguments.count {
|
||||
let arg = arguments[i]
|
||||
|
||||
switch arg {
|
||||
case "--use-gpu":
|
||||
args.useGpu = true
|
||||
case "--onnx-dir":
|
||||
if i + 1 < arguments.count {
|
||||
args.onnxDir = arguments[i + 1]
|
||||
i += 1
|
||||
}
|
||||
case "--total-step":
|
||||
if i + 1 < arguments.count {
|
||||
args.totalStep = Int(arguments[i + 1]) ?? 5
|
||||
i += 1
|
||||
}
|
||||
case "--speed":
|
||||
if i + 1 < arguments.count {
|
||||
args.speed = Float(arguments[i + 1]) ?? 1.05
|
||||
i += 1
|
||||
}
|
||||
case "--n-test":
|
||||
if i + 1 < arguments.count {
|
||||
args.nTest = Int(arguments[i + 1]) ?? 4
|
||||
i += 1
|
||||
}
|
||||
case "--voice-style":
|
||||
if i + 1 < arguments.count {
|
||||
args.voiceStyle = arguments[i + 1].components(separatedBy: ",")
|
||||
i += 1
|
||||
}
|
||||
case "--text":
|
||||
if i + 1 < arguments.count {
|
||||
args.text = arguments[i + 1].components(separatedBy: "|")
|
||||
i += 1
|
||||
}
|
||||
case "--lang":
|
||||
if i + 1 < arguments.count {
|
||||
args.lang = arguments[i + 1].components(separatedBy: ",")
|
||||
i += 1
|
||||
}
|
||||
case "--save-dir":
|
||||
if i + 1 < arguments.count {
|
||||
args.saveDir = arguments[i + 1]
|
||||
i += 1
|
||||
}
|
||||
case "--batch":
|
||||
args.batch = true
|
||||
default:
|
||||
break
|
||||
}
|
||||
|
||||
i += 1
|
||||
}
|
||||
|
||||
return args
|
||||
}
|
||||
|
||||
@main
|
||||
struct ExampleONNX {
|
||||
static func main() async {
|
||||
print("=== TTS Inference with ONNX Runtime (Swift) ===\n")
|
||||
|
||||
// --- 1. Parse arguments --- //
|
||||
let args = parseArgs()
|
||||
|
||||
if args.batch {
|
||||
guard args.voiceStyle.count == args.text.count else {
|
||||
print("Error: Number of voice styles (\(args.voiceStyle.count)) must match number of texts (\(args.text.count))")
|
||||
return
|
||||
}
|
||||
guard args.lang.count == args.text.count else {
|
||||
print("Error: Number of languages (\(args.lang.count)) must match number of texts (\(args.text.count))")
|
||||
return
|
||||
}
|
||||
}
|
||||
|
||||
let bsz = args.voiceStyle.count
|
||||
|
||||
do {
|
||||
let env = try ORTEnv(loggingLevel: .warning)
|
||||
|
||||
// --- 2. Load TTS components --- //
|
||||
let textToSpeech = try loadTextToSpeech(args.onnxDir, args.useGpu, env)
|
||||
|
||||
// --- 3. Load voice styles --- //
|
||||
let style = try loadVoiceStyle(args.voiceStyle, verbose: true)
|
||||
|
||||
// --- 4. Synthesize speech --- //
|
||||
try? FileManager.default.createDirectory(atPath: args.saveDir, withIntermediateDirectories: true)
|
||||
|
||||
for n in 0..<args.nTest {
|
||||
print("\n[\(n + 1)/\(args.nTest)] Starting synthesis...")
|
||||
|
||||
let wav: [Float]
|
||||
let duration: [Float]
|
||||
|
||||
if args.batch {
|
||||
let result = try timer("Generating speech from text") {
|
||||
try textToSpeech.batch(args.text, args.lang, style, args.totalStep, speed: args.speed)
|
||||
}
|
||||
wav = result.wav
|
||||
duration = result.duration
|
||||
} else {
|
||||
let result = try timer("Generating speech from text") {
|
||||
try textToSpeech.call(args.text[0], args.lang[0], style, args.totalStep, speed: args.speed, silenceDuration: 0.3)
|
||||
}
|
||||
wav = result.wav
|
||||
duration = [result.duration]
|
||||
}
|
||||
|
||||
// Save outputs
|
||||
for i in 0..<bsz {
|
||||
let fname = "\(sanitizeFilename(args.text[i], maxLen: 20))_\(n + 1).wav"
|
||||
let wavOut: [Float]
|
||||
|
||||
if args.batch {
|
||||
let wavLen = wav.count / bsz
|
||||
let actualLen = Int(Float(textToSpeech.sampleRate) * duration[i])
|
||||
let wavStart = i * wavLen
|
||||
let wavEnd = min(wavStart + actualLen, wavStart + wavLen)
|
||||
wavOut = Array(wav[wavStart..<wavEnd])
|
||||
} else {
|
||||
// For non-batch mode, wav is a single concatenated audio
|
||||
let actualLen = Int(Float(textToSpeech.sampleRate) * duration[0])
|
||||
wavOut = Array(wav.prefix(actualLen))
|
||||
}
|
||||
|
||||
let outputPath = "\(args.saveDir)/\(fname)"
|
||||
try writeWavFile(outputPath, wavOut, textToSpeech.sampleRate)
|
||||
print("Saved: \(outputPath)")
|
||||
}
|
||||
}
|
||||
|
||||
print("\n=== Synthesis completed successfully! ===")
|
||||
|
||||
} catch {
|
||||
print("Error during inference: \(error)")
|
||||
exit(1)
|
||||
}
|
||||
}
|
||||
}
|
||||
835
swift/Sources/Helper.swift
Normal file
@@ -0,0 +1,835 @@
|
||||
import Foundation
|
||||
import Accelerate
|
||||
import OnnxRuntimeBindings
|
||||
|
||||
// MARK: - Available Languages
|
||||
|
||||
let AVAILABLE_LANGS = ["en", "ko", "es", "pt", "fr"]
|
||||
|
||||
func isValidLang(_ lang: String) -> Bool {
|
||||
return AVAILABLE_LANGS.contains(lang)
|
||||
}
|
||||
|
||||
// MARK: - Configuration Structures
|
||||
|
||||
struct Config: Codable {
|
||||
struct AEConfig: Codable {
|
||||
let sample_rate: Int
|
||||
let base_chunk_size: Int
|
||||
}
|
||||
|
||||
struct TTLConfig: Codable {
|
||||
let chunk_compress_factor: Int
|
||||
let latent_dim: Int
|
||||
}
|
||||
|
||||
let ae: AEConfig
|
||||
let ttl: TTLConfig
|
||||
}
|
||||
|
||||
// MARK: - Voice Style Data Structure
|
||||
|
||||
struct VoiceStyleData: Codable {
|
||||
struct StyleComponent: Codable {
|
||||
let data: [[[Float]]]
|
||||
let dims: [Int]
|
||||
let type: String
|
||||
}
|
||||
|
||||
let style_ttl: StyleComponent
|
||||
let style_dp: StyleComponent
|
||||
}
|
||||
|
||||
// MARK: - Unicode Text Processor
|
||||
|
||||
class UnicodeProcessor {
|
||||
let indexer: [Int64]
|
||||
|
||||
init(unicodeIndexerPath: String) throws {
|
||||
let data = try Data(contentsOf: URL(fileURLWithPath: unicodeIndexerPath))
|
||||
self.indexer = try JSONDecoder().decode([Int64].self, from: data)
|
||||
}
|
||||
|
||||
func call(_ textList: [String], _ langList: [String]) -> (textIds: [[Int64]], textMask: [[[Float]]]) {
|
||||
var processedTexts = [String]()
|
||||
for (i, text) in textList.enumerated() {
|
||||
processedTexts.append(preprocessText(text, lang: langList[i]))
|
||||
}
|
||||
|
||||
// Use unicodeScalars.count for correct length after NFKD decomposition
|
||||
var textIdsLengths = [Int]()
|
||||
for text in processedTexts {
|
||||
textIdsLengths.append(text.unicodeScalars.count)
|
||||
}
|
||||
|
||||
let maxLen = textIdsLengths.max() ?? 0
|
||||
|
||||
var textIds = [[Int64]]()
|
||||
for text in processedTexts {
|
||||
var row = Array(repeating: Int64(0), count: maxLen)
|
||||
let unicodeValues = Array(text.unicodeScalars.map { Int($0.value) })
|
||||
for (j, val) in unicodeValues.enumerated() {
|
||||
if val < indexer.count {
|
||||
row[j] = indexer[val]
|
||||
} else {
|
||||
row[j] = -1
|
||||
}
|
||||
}
|
||||
textIds.append(row)
|
||||
}
|
||||
|
||||
let textMask = getTextMask(textIdsLengths)
|
||||
return (textIds, textMask)
|
||||
}
|
||||
}
|
||||
|
||||
func preprocessText(_ text: String, lang: String) -> String {
|
||||
// Use NFKD (decomposed) for proper Hangul Jamo decomposition
|
||||
var text = text.decomposedStringWithCompatibilityMapping
|
||||
|
||||
// Remove emojis (wide Unicode range)
|
||||
// Swift NSRegularExpression doesn't support Unicode escapes above \uFFFF
|
||||
// Use character filtering instead
|
||||
text = text.unicodeScalars.filter { scalar in
|
||||
let value = scalar.value
|
||||
return !((value >= 0x1F600 && value <= 0x1F64F) ||
|
||||
(value >= 0x1F300 && value <= 0x1F5FF) ||
|
||||
(value >= 0x1F680 && value <= 0x1F6FF) ||
|
||||
(value >= 0x1F700 && value <= 0x1F77F) ||
|
||||
(value >= 0x1F780 && value <= 0x1F7FF) ||
|
||||
(value >= 0x1F800 && value <= 0x1F8FF) ||
|
||||
(value >= 0x1F900 && value <= 0x1F9FF) ||
|
||||
(value >= 0x1FA00 && value <= 0x1FA6F) ||
|
||||
(value >= 0x1FA70 && value <= 0x1FAFF) ||
|
||||
(value >= 0x2600 && value <= 0x26FF) ||
|
||||
(value >= 0x2700 && value <= 0x27BF) ||
|
||||
(value >= 0x1F1E6 && value <= 0x1F1FF))
|
||||
}.map { String($0) }.joined()
|
||||
|
||||
// Replace various dashes and symbols
|
||||
let replacements: [String: String] = [
|
||||
"–": "-", // en dash
|
||||
"‑": "-", // non-breaking hyphen
|
||||
"—": "-", // em dash
|
||||
"_": " ", // underscore
|
||||
"\u{201C}": "\"", // left double quote
|
||||
"\u{201D}": "\"", // right double quote
|
||||
"\u{2018}": "'", // left single quote
|
||||
"\u{2019}": "'", // right single quote
|
||||
"´": "'", // acute accent
|
||||
"`": "'", // grave accent
|
||||
"[": " ", // left bracket
|
||||
"]": " ", // right bracket
|
||||
"|": " ", // vertical bar
|
||||
"/": " ", // slash
|
||||
"#": " ", // hash
|
||||
"→": " ", // right arrow
|
||||
"←": " ", // left arrow
|
||||
]
|
||||
|
||||
for (old, new) in replacements {
|
||||
text = text.replacingOccurrences(of: old, with: new)
|
||||
}
|
||||
|
||||
// Remove special symbols
|
||||
let specialSymbols = ["♥", "☆", "♡", "©", "\\"]
|
||||
for symbol in specialSymbols {
|
||||
text = text.replacingOccurrences(of: symbol, with: "")
|
||||
}
|
||||
|
||||
// Replace known expressions
|
||||
let exprReplacements: [String: String] = [
|
||||
"@": " at ",
|
||||
"e.g.,": "for example, ",
|
||||
"i.e.,": "that is, ",
|
||||
]
|
||||
|
||||
for (old, new) in exprReplacements {
|
||||
text = text.replacingOccurrences(of: old, with: new)
|
||||
}
|
||||
|
||||
// Fix spacing around punctuation
|
||||
text = text.replacingOccurrences(of: " ,", with: ",")
|
||||
text = text.replacingOccurrences(of: " .", with: ".")
|
||||
text = text.replacingOccurrences(of: " !", with: "!")
|
||||
text = text.replacingOccurrences(of: " ?", with: "?")
|
||||
text = text.replacingOccurrences(of: " ;", with: ";")
|
||||
text = text.replacingOccurrences(of: " :", with: ":")
|
||||
text = text.replacingOccurrences(of: " '", with: "'")
|
||||
|
||||
// Remove duplicate quotes
|
||||
while text.contains("\"\"") {
|
||||
text = text.replacingOccurrences(of: "\"\"", with: "\"")
|
||||
}
|
||||
while text.contains("''") {
|
||||
text = text.replacingOccurrences(of: "''", with: "'")
|
||||
}
|
||||
while text.contains("``") {
|
||||
text = text.replacingOccurrences(of: "``", with: "`")
|
||||
}
|
||||
|
||||
// Remove extra spaces
|
||||
let whitespacePattern = try! NSRegularExpression(pattern: "\\s+")
|
||||
let whitespaceRange = NSRange(text.startIndex..., in: text)
|
||||
text = whitespacePattern.stringByReplacingMatches(in: text, range: whitespaceRange, withTemplate: " ")
|
||||
text = text.trimmingCharacters(in: .whitespacesAndNewlines)
|
||||
|
||||
// If text doesn't end with punctuation, quotes, or closing brackets, add a period
|
||||
if !text.isEmpty {
|
||||
let punctPattern = try! NSRegularExpression(pattern: "[.!?;:,'\"\\u201C\\u201D\\u2018\\u2019)\\]}…。」』】〉》›»]$")
|
||||
let punctRange = NSRange(text.startIndex..., in: text)
|
||||
if punctPattern.firstMatch(in: text, range: punctRange) == nil {
|
||||
text += "."
|
||||
}
|
||||
}
|
||||
|
||||
// Validate language
|
||||
guard isValidLang(lang) else {
|
||||
fatalError("Invalid language: \(lang). Available: \(AVAILABLE_LANGS.joined(separator: ", "))")
|
||||
}
|
||||
|
||||
// Wrap text with language tags
|
||||
text = "<\(lang)>\(text)</\(lang)>"
|
||||
|
||||
return text
|
||||
}
|
||||
|
||||
func lengthToMask(_ lengths: [Int], maxLen: Int? = nil) -> [[[Float]]] {
|
||||
let actualMaxLen = maxLen ?? (lengths.max() ?? 0)
|
||||
|
||||
var mask = [[[Float]]]()
|
||||
for len in lengths {
|
||||
var row = Array(repeating: Float(0.0), count: actualMaxLen)
|
||||
for j in 0..<min(len, actualMaxLen) {
|
||||
row[j] = 1.0
|
||||
}
|
||||
mask.append([row])
|
||||
}
|
||||
return mask
|
||||
}
|
||||
|
||||
func getTextMask(_ textIdsLengths: [Int]) -> [[[Float]]] {
|
||||
let maxLen = textIdsLengths.max() ?? 0
|
||||
return lengthToMask(textIdsLengths, maxLen: maxLen)
|
||||
}
|
||||
|
||||
func sampleNoisyLatent(duration: [Float], sampleRate: Int, baseChunkSize: Int, chunkCompress: Int, latentDim: Int) -> (noisyLatent: [[[Float]]], latentMask: [[[Float]]]) {
|
||||
let bsz = duration.count
|
||||
let maxDur = duration.max() ?? 0.0
|
||||
|
||||
let wavLenMax = Int(maxDur * Float(sampleRate))
|
||||
var wavLengths = [Int]()
|
||||
for d in duration {
|
||||
wavLengths.append(Int(d * Float(sampleRate)))
|
||||
}
|
||||
|
||||
let chunkSize = baseChunkSize * chunkCompress
|
||||
let latentLen = (wavLenMax + chunkSize - 1) / chunkSize
|
||||
let latentDimVal = latentDim * chunkCompress
|
||||
|
||||
var noisyLatent = [[[Float]]]()
|
||||
for _ in 0..<bsz {
|
||||
var batch = [[Float]]()
|
||||
for _ in 0..<latentDimVal {
|
||||
var row = [Float]()
|
||||
for _ in 0..<latentLen {
|
||||
// Box-Muller transform
|
||||
let u1 = Float.random(in: 0.0001...1.0)
|
||||
let u2 = Float.random(in: 0.0...1.0)
|
||||
let val = sqrt(-2.0 * log(u1)) * cos(2.0 * Float.pi * u2)
|
||||
row.append(val)
|
||||
}
|
||||
batch.append(row)
|
||||
}
|
||||
noisyLatent.append(batch)
|
||||
}
|
||||
|
||||
var latentLengths = [Int]()
|
||||
for len in wavLengths {
|
||||
latentLengths.append((len + chunkSize - 1) / chunkSize)
|
||||
}
|
||||
|
||||
let latentMask = lengthToMask(latentLengths, maxLen: latentLen)
|
||||
|
||||
// Apply mask
|
||||
for b in 0..<bsz {
|
||||
for d in 0..<latentDimVal {
|
||||
for t in 0..<latentLen {
|
||||
noisyLatent[b][d][t] *= latentMask[b][0][t]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return (noisyLatent, latentMask)
|
||||
}
|
||||
|
||||
func getLatentMask(_ wavLengths: [Int64], _ cfgs: Config) -> [[[Float]]] {
|
||||
let baseChunkSize = cfgs.ae.base_chunk_size
|
||||
let chunkCompressFactor = cfgs.ttl.chunk_compress_factor
|
||||
let latentSize = baseChunkSize * chunkCompressFactor
|
||||
|
||||
var latentLengths = [Int]()
|
||||
for len in wavLengths {
|
||||
latentLengths.append((Int(len) + latentSize - 1) / latentSize)
|
||||
}
|
||||
|
||||
let maxLen = latentLengths.max() ?? 0
|
||||
return lengthToMask(latentLengths, maxLen: maxLen)
|
||||
}
|
||||
|
||||
// MARK: - WAV File I/O
|
||||
|
||||
func writeWavFile(_ filename: String, _ audioData: [Float], _ sampleRate: Int) throws {
|
||||
let url = URL(fileURLWithPath: filename)
|
||||
|
||||
// Convert float to int16
|
||||
let int16Data = audioData.map { sample -> Int16 in
|
||||
let clamped = max(-1.0, min(1.0, sample))
|
||||
return Int16(clamped * 32767.0)
|
||||
}
|
||||
|
||||
// Create WAV header
|
||||
let numChannels: UInt16 = 1
|
||||
let bitsPerSample: UInt16 = 16
|
||||
let byteRate = UInt32(sampleRate) * UInt32(numChannels) * UInt32(bitsPerSample) / 8
|
||||
let blockAlign = numChannels * bitsPerSample / 8
|
||||
let dataSize = UInt32(int16Data.count * 2)
|
||||
|
||||
var data = Data()
|
||||
|
||||
// RIFF chunk
|
||||
data.append("RIFF".data(using: .ascii)!)
|
||||
withUnsafeBytes(of: UInt32(36 + dataSize).littleEndian) { data.append(contentsOf: $0) }
|
||||
data.append("WAVE".data(using: .ascii)!)
|
||||
|
||||
// fmt chunk
|
||||
data.append("fmt ".data(using: .ascii)!)
|
||||
withUnsafeBytes(of: UInt32(16).littleEndian) { data.append(contentsOf: $0) }
|
||||
withUnsafeBytes(of: UInt16(1).littleEndian) { data.append(contentsOf: $0) } // PCM
|
||||
withUnsafeBytes(of: numChannels.littleEndian) { data.append(contentsOf: $0) }
|
||||
withUnsafeBytes(of: UInt32(sampleRate).littleEndian) { data.append(contentsOf: $0) }
|
||||
withUnsafeBytes(of: byteRate.littleEndian) { data.append(contentsOf: $0) }
|
||||
withUnsafeBytes(of: blockAlign.littleEndian) { data.append(contentsOf: $0) }
|
||||
withUnsafeBytes(of: bitsPerSample.littleEndian) { data.append(contentsOf: $0) }
|
||||
|
||||
// data chunk
|
||||
data.append("data".data(using: .ascii)!)
|
||||
withUnsafeBytes(of: dataSize.littleEndian) { data.append(contentsOf: $0) }
|
||||
|
||||
// audio data
|
||||
int16Data.withUnsafeBytes { data.append(contentsOf: $0) }
|
||||
|
||||
try data.write(to: url)
|
||||
}
|
||||
|
||||
// MARK: - Text Chunking
|
||||
|
||||
let MAX_CHUNK_LENGTH = 300
|
||||
let ABBREVIATIONS = [
|
||||
"Dr.", "Mr.", "Mrs.", "Ms.", "Prof.", "Sr.", "Jr.",
|
||||
"St.", "Ave.", "Rd.", "Blvd.", "Dept.", "Inc.", "Ltd.",
|
||||
"Co.", "Corp.", "etc.", "vs.", "i.e.", "e.g.", "Ph.D."
|
||||
]
|
||||
|
||||
func chunkText(_ text: String, maxLen: Int = 0) -> [String] {
|
||||
let actualMaxLen = maxLen > 0 ? maxLen : MAX_CHUNK_LENGTH
|
||||
let trimmedText = text.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines)
|
||||
|
||||
if trimmedText.isEmpty {
|
||||
return [""]
|
||||
}
|
||||
|
||||
// Split by paragraphs using regex
|
||||
let paraPattern = try! NSRegularExpression(pattern: "\\n\\s*\\n")
|
||||
let paraRange = NSRange(trimmedText.startIndex..., in: trimmedText)
|
||||
var paragraphs = [String]()
|
||||
var lastEnd = trimmedText.startIndex
|
||||
|
||||
paraPattern.enumerateMatches(in: trimmedText, range: paraRange) { match, _, _ in
|
||||
if let match = match, let range = Range(match.range, in: trimmedText) {
|
||||
paragraphs.append(String(trimmedText[lastEnd..<range.lowerBound]))
|
||||
lastEnd = range.upperBound
|
||||
}
|
||||
}
|
||||
if lastEnd < trimmedText.endIndex {
|
||||
paragraphs.append(String(trimmedText[lastEnd...]))
|
||||
}
|
||||
if paragraphs.isEmpty {
|
||||
paragraphs = [trimmedText]
|
||||
}
|
||||
|
||||
var chunks = [String]()
|
||||
|
||||
for para in paragraphs {
|
||||
let trimmedPara = para.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines)
|
||||
if trimmedPara.isEmpty {
|
||||
continue
|
||||
}
|
||||
|
||||
if trimmedPara.count <= actualMaxLen {
|
||||
chunks.append(trimmedPara)
|
||||
continue
|
||||
}
|
||||
|
||||
// Split by sentences
|
||||
let sentences = splitSentences(trimmedPara)
|
||||
var current = ""
|
||||
var currentLen = 0
|
||||
|
||||
for sentence in sentences {
|
||||
let trimmedSentence = sentence.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines)
|
||||
if trimmedSentence.isEmpty {
|
||||
continue
|
||||
}
|
||||
|
||||
let sentenceLen = trimmedSentence.count
|
||||
if sentenceLen > actualMaxLen {
|
||||
// If sentence is longer than maxLen, split by comma or space
|
||||
if !current.isEmpty {
|
||||
chunks.append(current.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines))
|
||||
current = ""
|
||||
currentLen = 0
|
||||
}
|
||||
|
||||
// Try splitting by comma
|
||||
let parts = trimmedSentence.components(separatedBy: ",")
|
||||
for part in parts {
|
||||
let trimmedPart = part.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines)
|
||||
if trimmedPart.isEmpty {
|
||||
continue
|
||||
}
|
||||
|
||||
let partLen = trimmedPart.count
|
||||
if partLen > actualMaxLen {
|
||||
// Split by space as last resort
|
||||
let words = trimmedPart.components(separatedBy: CharacterSet.whitespaces).filter { !$0.isEmpty }
|
||||
var wordChunk = ""
|
||||
var wordChunkLen = 0
|
||||
|
||||
for word in words {
|
||||
let wordLen = word.count
|
||||
if wordChunkLen + wordLen + 1 > actualMaxLen && !wordChunk.isEmpty {
|
||||
chunks.append(wordChunk.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines))
|
||||
wordChunk = ""
|
||||
wordChunkLen = 0
|
||||
}
|
||||
|
||||
if !wordChunk.isEmpty {
|
||||
wordChunk += " "
|
||||
wordChunkLen += 1
|
||||
}
|
||||
wordChunk += word
|
||||
wordChunkLen += wordLen
|
||||
}
|
||||
|
||||
if !wordChunk.isEmpty {
|
||||
chunks.append(wordChunk.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines))
|
||||
}
|
||||
} else {
|
||||
if currentLen + partLen + 1 > actualMaxLen && !current.isEmpty {
|
||||
chunks.append(current.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines))
|
||||
current = ""
|
||||
currentLen = 0
|
||||
}
|
||||
|
||||
if !current.isEmpty {
|
||||
current += ", "
|
||||
currentLen += 2
|
||||
}
|
||||
current += trimmedPart
|
||||
currentLen += partLen
|
||||
}
|
||||
}
|
||||
continue
|
||||
}
|
||||
|
||||
if currentLen + sentenceLen + 1 > actualMaxLen && !current.isEmpty {
|
||||
chunks.append(current.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines))
|
||||
current = ""
|
||||
currentLen = 0
|
||||
}
|
||||
|
||||
if !current.isEmpty {
|
||||
current += " "
|
||||
currentLen += 1
|
||||
}
|
||||
current += trimmedSentence
|
||||
currentLen += sentenceLen
|
||||
}
|
||||
|
||||
if !current.isEmpty {
|
||||
chunks.append(current.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines))
|
||||
}
|
||||
}
|
||||
|
||||
return chunks.isEmpty ? [""] : chunks
|
||||
}
|
||||
|
||||
func splitSentences(_ text: String) -> [String] {
|
||||
// Swift's regex doesn't support lookbehind reliably, so we use a simpler approach
|
||||
// Split on sentence boundaries and then check if they're abbreviations
|
||||
let regex = try! NSRegularExpression(pattern: "([.!?])\\s+")
|
||||
let range = NSRange(text.startIndex..., in: text)
|
||||
|
||||
// Find all matches
|
||||
let matches = regex.matches(in: text, range: range)
|
||||
if matches.isEmpty {
|
||||
return [text]
|
||||
}
|
||||
|
||||
var sentences = [String]()
|
||||
var lastEnd = text.startIndex
|
||||
|
||||
for match in matches {
|
||||
guard let matchRange = Range(match.range, in: text) else { continue }
|
||||
|
||||
// Get the text before the punctuation
|
||||
let beforePunc = String(text[lastEnd..<matchRange.lowerBound])
|
||||
|
||||
// Get the punctuation character
|
||||
let puncRange = Range(NSRange(location: match.range.location, length: 1), in: text)!
|
||||
let punc = String(text[puncRange])
|
||||
|
||||
// Check if this ends with an abbreviation
|
||||
var isAbbrev = false
|
||||
let combined = beforePunc.trimmingCharacters(in: CharacterSet.whitespaces) + punc
|
||||
for abbrev in ABBREVIATIONS {
|
||||
if combined.hasSuffix(abbrev) {
|
||||
isAbbrev = true
|
||||
break
|
||||
}
|
||||
}
|
||||
|
||||
if !isAbbrev {
|
||||
// This is a real sentence boundary
|
||||
sentences.append(String(text[lastEnd..<matchRange.upperBound]))
|
||||
lastEnd = matchRange.upperBound
|
||||
}
|
||||
}
|
||||
|
||||
// Add the remaining text
|
||||
if lastEnd < text.endIndex {
|
||||
sentences.append(String(text[lastEnd...]))
|
||||
}
|
||||
|
||||
return sentences.isEmpty ? [text] : sentences
|
||||
}
|
||||
|
||||
// MARK: - Utility Functions
|
||||
|
||||
func timer<T>(_ name: String, _ f: () throws -> T) rethrows -> T {
|
||||
let start = Date()
|
||||
print("\(name)...")
|
||||
let result = try f()
|
||||
let elapsed = Date().timeIntervalSince(start)
|
||||
print(String(format: " -> %@ completed in %.2f sec", name, elapsed))
|
||||
return result
|
||||
}
|
||||
|
||||
func sanitizeFilename(_ text: String, maxLen: Int) -> String {
|
||||
let truncated = text.count > maxLen ? String(text.prefix(maxLen)) : text
|
||||
return truncated.map { char in
|
||||
if char.isLetter || char.isNumber {
|
||||
return char
|
||||
} else {
|
||||
return Character("_")
|
||||
}
|
||||
}.map(String.init).joined()
|
||||
}
|
||||
|
||||
func loadCfgs(_ onnxDir: String) throws -> Config {
|
||||
let cfgPath = "\(onnxDir)/tts.json"
|
||||
let data = try Data(contentsOf: URL(fileURLWithPath: cfgPath))
|
||||
let config = try JSONDecoder().decode(Config.self, from: data)
|
||||
return config
|
||||
}
|
||||
|
||||
// MARK: - ONNX Runtime Integration
|
||||
|
||||
struct Style {
|
||||
let ttl: ORTValue
|
||||
let dp: ORTValue
|
||||
}
|
||||
|
||||
class TextToSpeech {
|
||||
let cfgs: Config
|
||||
let textProcessor: UnicodeProcessor
|
||||
let dpOrt: ORTSession
|
||||
let textEncOrt: ORTSession
|
||||
let vectorEstOrt: ORTSession
|
||||
let vocoderOrt: ORTSession
|
||||
let sampleRate: Int
|
||||
|
||||
init(cfgs: Config, textProcessor: UnicodeProcessor,
|
||||
dpOrt: ORTSession, textEncOrt: ORTSession,
|
||||
vectorEstOrt: ORTSession, vocoderOrt: ORTSession) {
|
||||
self.cfgs = cfgs
|
||||
self.textProcessor = textProcessor
|
||||
self.dpOrt = dpOrt
|
||||
self.textEncOrt = textEncOrt
|
||||
self.vectorEstOrt = vectorEstOrt
|
||||
self.vocoderOrt = vocoderOrt
|
||||
self.sampleRate = cfgs.ae.sample_rate
|
||||
}
|
||||
|
||||
private func _infer(_ textList: [String], _ langList: [String], _ style: Style, _ totalStep: Int, speed: Float = 1.05) throws -> (wav: [Float], duration: [Float]) {
|
||||
let bsz = textList.count
|
||||
|
||||
// Process text
|
||||
let (textIds, textMask) = textProcessor.call(textList, langList)
|
||||
|
||||
// Flatten text IDs
|
||||
let textIdsFlat = textIds.flatMap { $0 }
|
||||
let textIdsShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: textIds[0].count)]
|
||||
let textIdsValue = try ORTValue(tensorData: NSMutableData(bytes: textIdsFlat, length: textIdsFlat.count * MemoryLayout<Int64>.size),
|
||||
elementType: .int64,
|
||||
shape: textIdsShape)
|
||||
|
||||
// Flatten text mask
|
||||
let textMaskFlat = textMask.flatMap { $0.flatMap { $0 } }
|
||||
let textMaskShape: [NSNumber] = [NSNumber(value: bsz), 1, NSNumber(value: textMask[0][0].count)]
|
||||
let textMaskValue = try ORTValue(tensorData: NSMutableData(bytes: textMaskFlat, length: textMaskFlat.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: textMaskShape)
|
||||
|
||||
// Predict duration
|
||||
let dpOutputs = try dpOrt.run(withInputs: ["text_ids": textIdsValue, "style_dp": style.dp, "text_mask": textMaskValue],
|
||||
outputNames: ["duration"],
|
||||
runOptions: nil)
|
||||
|
||||
let durationData = try dpOutputs["duration"]!.tensorData() as Data
|
||||
var duration = durationData.withUnsafeBytes { ptr in
|
||||
Array(ptr.bindMemory(to: Float.self))
|
||||
}
|
||||
|
||||
// Apply speed factor to duration
|
||||
for i in 0..<duration.count {
|
||||
duration[i] /= speed
|
||||
}
|
||||
|
||||
// Encode text
|
||||
let textEncOutputs = try textEncOrt.run(withInputs: ["text_ids": textIdsValue, "style_ttl": style.ttl, "text_mask": textMaskValue],
|
||||
outputNames: ["text_emb"],
|
||||
runOptions: nil)
|
||||
|
||||
let textEmbValue = textEncOutputs["text_emb"]!
|
||||
|
||||
// Sample noisy latent
|
||||
var (xt, latentMask) = sampleNoisyLatent(duration: duration, sampleRate: sampleRate,
|
||||
baseChunkSize: cfgs.ae.base_chunk_size,
|
||||
chunkCompress: cfgs.ttl.chunk_compress_factor,
|
||||
latentDim: cfgs.ttl.latent_dim)
|
||||
|
||||
// Prepare constant arrays
|
||||
let totalStepArray = Array(repeating: Float(totalStep), count: bsz)
|
||||
let totalStepValue = try ORTValue(tensorData: NSMutableData(bytes: totalStepArray, length: totalStepArray.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: [NSNumber(value: bsz)])
|
||||
|
||||
// Denoising loop
|
||||
for step in 0..<totalStep {
|
||||
let currentStepArray = Array(repeating: Float(step), count: bsz)
|
||||
let currentStepValue = try ORTValue(tensorData: NSMutableData(bytes: currentStepArray, length: currentStepArray.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: [NSNumber(value: bsz)])
|
||||
|
||||
// Flatten xt
|
||||
let xtFlat = xt.flatMap { $0.flatMap { $0 } }
|
||||
let xtShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: xt[0].count), NSNumber(value: xt[0][0].count)]
|
||||
let xtValue = try ORTValue(tensorData: NSMutableData(bytes: xtFlat, length: xtFlat.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: xtShape)
|
||||
|
||||
// Flatten latent mask
|
||||
let latentMaskFlat = latentMask.flatMap { $0.flatMap { $0 } }
|
||||
let latentMaskShape: [NSNumber] = [NSNumber(value: bsz), 1, NSNumber(value: latentMask[0][0].count)]
|
||||
let latentMaskValue = try ORTValue(tensorData: NSMutableData(bytes: latentMaskFlat, length: latentMaskFlat.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: latentMaskShape)
|
||||
|
||||
let vectorEstOutputs = try vectorEstOrt.run(withInputs: [
|
||||
"noisy_latent": xtValue,
|
||||
"text_emb": textEmbValue,
|
||||
"style_ttl": style.ttl,
|
||||
"latent_mask": latentMaskValue,
|
||||
"text_mask": textMaskValue,
|
||||
"current_step": currentStepValue,
|
||||
"total_step": totalStepValue
|
||||
], outputNames: ["denoised_latent"], runOptions: nil)
|
||||
|
||||
let denoisedData = try vectorEstOutputs["denoised_latent"]!.tensorData() as Data
|
||||
let denoisedFlat = denoisedData.withUnsafeBytes { ptr in
|
||||
Array(ptr.bindMemory(to: Float.self))
|
||||
}
|
||||
|
||||
// Reshape to 3D
|
||||
let latentDimVal = xt[0].count
|
||||
let latentLen = xt[0][0].count
|
||||
xt = []
|
||||
var idx = 0
|
||||
for _ in 0..<bsz {
|
||||
var batch = [[Float]]()
|
||||
for _ in 0..<latentDimVal {
|
||||
var row = [Float]()
|
||||
for _ in 0..<latentLen {
|
||||
row.append(denoisedFlat[idx])
|
||||
idx += 1
|
||||
}
|
||||
batch.append(row)
|
||||
}
|
||||
xt.append(batch)
|
||||
}
|
||||
}
|
||||
|
||||
// Generate waveform
|
||||
let finalXtFlat = xt.flatMap { $0.flatMap { $0 } }
|
||||
let finalXtShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: xt[0].count), NSNumber(value: xt[0][0].count)]
|
||||
let finalXtValue = try ORTValue(tensorData: NSMutableData(bytes: finalXtFlat, length: finalXtFlat.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: finalXtShape)
|
||||
|
||||
let vocoderOutputs = try vocoderOrt.run(withInputs: ["latent": finalXtValue],
|
||||
outputNames: ["wav_tts"],
|
||||
runOptions: nil)
|
||||
|
||||
let wavData = try vocoderOutputs["wav_tts"]!.tensorData() as Data
|
||||
let wav = wavData.withUnsafeBytes { ptr in
|
||||
Array(ptr.bindMemory(to: Float.self))
|
||||
}
|
||||
|
||||
return (wav, duration)
|
||||
}
|
||||
|
||||
func call(_ text: String, _ lang: String, _ style: Style, _ totalStep: Int, speed: Float = 1.05, silenceDuration: Float = 0.3) throws -> (wav: [Float], duration: Float) {
|
||||
let maxLen = lang == "ko" ? 120 : 300
|
||||
let chunks = chunkText(text, maxLen: maxLen)
|
||||
let langList = Array(repeating: lang, count: chunks.count)
|
||||
|
||||
var wavCat = [Float]()
|
||||
var durCat: Float = 0.0
|
||||
|
||||
for (i, chunk) in chunks.enumerated() {
|
||||
let result = try _infer([chunk], [langList[i]], style, totalStep, speed: speed)
|
||||
|
||||
let dur = result.duration[0]
|
||||
let wavLen = Int(Float(sampleRate) * dur)
|
||||
let wavChunk = Array(result.wav.prefix(wavLen))
|
||||
|
||||
if i == 0 {
|
||||
wavCat = wavChunk
|
||||
durCat = dur
|
||||
} else {
|
||||
let silenceLen = Int(silenceDuration * Float(sampleRate))
|
||||
let silence = [Float](repeating: 0.0, count: silenceLen)
|
||||
|
||||
wavCat.append(contentsOf: silence)
|
||||
wavCat.append(contentsOf: wavChunk)
|
||||
durCat += silenceDuration + dur
|
||||
}
|
||||
}
|
||||
|
||||
return (wavCat, durCat)
|
||||
}
|
||||
|
||||
func batch(_ textList: [String], _ langList: [String], _ style: Style, _ totalStep: Int, speed: Float = 1.05) throws -> (wav: [Float], duration: [Float]) {
|
||||
return try _infer(textList, langList, style, totalStep, speed: speed)
|
||||
}
|
||||
}
|
||||
|
||||
// MARK: - Component Loading Functions
|
||||
|
||||
func loadVoiceStyle(_ voiceStylePaths: [String], verbose: Bool) throws -> Style {
|
||||
let bsz = voiceStylePaths.count
|
||||
|
||||
// Read first file to get dimensions
|
||||
let firstData = try Data(contentsOf: URL(fileURLWithPath: voiceStylePaths[0]))
|
||||
let firstStyle = try JSONDecoder().decode(VoiceStyleData.self, from: firstData)
|
||||
|
||||
let ttlDims = firstStyle.style_ttl.dims
|
||||
let dpDims = firstStyle.style_dp.dims
|
||||
|
||||
let ttlDim1 = ttlDims[1]
|
||||
let ttlDim2 = ttlDims[2]
|
||||
let dpDim1 = dpDims[1]
|
||||
let dpDim2 = dpDims[2]
|
||||
|
||||
// Pre-allocate arrays with full batch size
|
||||
let ttlSize = bsz * ttlDim1 * ttlDim2
|
||||
let dpSize = bsz * dpDim1 * dpDim2
|
||||
var ttlFlat = [Float](repeating: 0.0, count: ttlSize)
|
||||
var dpFlat = [Float](repeating: 0.0, count: dpSize)
|
||||
|
||||
// Fill in the data
|
||||
for (i, path) in voiceStylePaths.enumerated() {
|
||||
let data = try Data(contentsOf: URL(fileURLWithPath: path))
|
||||
let voiceStyle = try JSONDecoder().decode(VoiceStyleData.self, from: data)
|
||||
|
||||
// Flatten TTL data
|
||||
let ttlOffset = i * ttlDim1 * ttlDim2
|
||||
var idx = 0
|
||||
for batch in voiceStyle.style_ttl.data {
|
||||
for row in batch {
|
||||
for val in row {
|
||||
ttlFlat[ttlOffset + idx] = val
|
||||
idx += 1
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Flatten DP data
|
||||
let dpOffset = i * dpDim1 * dpDim2
|
||||
idx = 0
|
||||
for batch in voiceStyle.style_dp.data {
|
||||
for row in batch {
|
||||
for val in row {
|
||||
dpFlat[dpOffset + idx] = val
|
||||
idx += 1
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let ttlShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: ttlDim1), NSNumber(value: ttlDim2)]
|
||||
let dpShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: dpDim1), NSNumber(value: dpDim2)]
|
||||
|
||||
let ttlValue = try ORTValue(tensorData: NSMutableData(bytes: &ttlFlat, length: ttlFlat.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: ttlShape)
|
||||
let dpValue = try ORTValue(tensorData: NSMutableData(bytes: &dpFlat, length: dpFlat.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: dpShape)
|
||||
|
||||
if verbose {
|
||||
print("Loaded \(bsz) voice styles\n")
|
||||
}
|
||||
|
||||
return Style(ttl: ttlValue, dp: dpValue)
|
||||
}
|
||||
|
||||
func loadTextToSpeech(_ onnxDir: String, _ useGpu: Bool, _ env: ORTEnv) throws -> TextToSpeech {
|
||||
if useGpu {
|
||||
throw NSError(domain: "TTS", code: 1, userInfo: [NSLocalizedDescriptionKey: "GPU mode is not supported yet"])
|
||||
}
|
||||
print("Using CPU for inference\n")
|
||||
|
||||
let cfgs = try loadCfgs(onnxDir)
|
||||
|
||||
let sessionOptions = try ORTSessionOptions()
|
||||
|
||||
let dpPath = "\(onnxDir)/duration_predictor.onnx"
|
||||
let textEncPath = "\(onnxDir)/text_encoder.onnx"
|
||||
let vectorEstPath = "\(onnxDir)/vector_estimator.onnx"
|
||||
let vocoderPath = "\(onnxDir)/vocoder.onnx"
|
||||
|
||||
let dpOrt = try ORTSession(env: env, modelPath: dpPath, sessionOptions: sessionOptions)
|
||||
let textEncOrt = try ORTSession(env: env, modelPath: textEncPath, sessionOptions: sessionOptions)
|
||||
let vectorEstOrt = try ORTSession(env: env, modelPath: vectorEstPath, sessionOptions: sessionOptions)
|
||||
let vocoderOrt = try ORTSession(env: env, modelPath: vocoderPath, sessionOptions: sessionOptions)
|
||||
|
||||
let unicodeIndexerPath = "\(onnxDir)/unicode_indexer.json"
|
||||
let textProcessor = try UnicodeProcessor(unicodeIndexerPath: unicodeIndexerPath)
|
||||
|
||||
return TextToSpeech(cfgs: cfgs, textProcessor: textProcessor,
|
||||
dpOrt: dpOrt, textEncOrt: textEncOrt,
|
||||
vectorEstOrt: vectorEstOrt, vocoderOrt: vocoderOrt)
|
||||
}
|
||||
330
test_all.sh
Normal file
@@ -0,0 +1,330 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Supertonic - Test All Language Implementations
|
||||
# This script runs inference tests for all supported languages except web
|
||||
|
||||
set -e # Exit on error
|
||||
|
||||
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
|
||||
cd "$SCRIPT_DIR"
|
||||
|
||||
echo "=================================="
|
||||
echo "Supertonic - Testing All Examples"
|
||||
echo "=================================="
|
||||
echo ""
|
||||
|
||||
# Ask user to select test mode
|
||||
echo "Select test mode:"
|
||||
echo " 1) Default inference only"
|
||||
echo " 2) Batch inference only"
|
||||
echo " 3) Long-form inference only"
|
||||
echo " 4) All tests (default + batch + long-form)"
|
||||
echo -e "Enter your choice (1/2/3/4) [default: 1]: \c"
|
||||
read -r test_mode
|
||||
test_mode=${test_mode:-1}
|
||||
|
||||
case $test_mode in
|
||||
1)
|
||||
TEST_DEFAULT=true
|
||||
TEST_BATCH=false
|
||||
TEST_LONGFORM=false
|
||||
echo "Running default inference tests only"
|
||||
;;
|
||||
2)
|
||||
TEST_DEFAULT=false
|
||||
TEST_BATCH=true
|
||||
TEST_LONGFORM=false
|
||||
echo "Running batch inference tests only"
|
||||
;;
|
||||
3)
|
||||
TEST_DEFAULT=false
|
||||
TEST_BATCH=false
|
||||
TEST_LONGFORM=true
|
||||
echo "Running long-form inference tests only"
|
||||
;;
|
||||
4)
|
||||
TEST_DEFAULT=true
|
||||
TEST_BATCH=true
|
||||
TEST_LONGFORM=true
|
||||
echo "Running all tests (default + batch + long-form)"
|
||||
;;
|
||||
*)
|
||||
echo "Invalid choice. Using default inference only."
|
||||
TEST_DEFAULT=true
|
||||
TEST_BATCH=false
|
||||
TEST_LONGFORM=false
|
||||
;;
|
||||
esac
|
||||
echo ""
|
||||
|
||||
# Batch inference test data - multilingual examples
|
||||
BATCH_VOICE_STYLE_1="assets/voice_styles/M1.json"
|
||||
BATCH_VOICE_STYLE_2="assets/voice_styles/F1.json"
|
||||
BATCH_TEXT_1="The sun sets behind the mountains, painting the sky in shades of pink and orange."
|
||||
BATCH_TEXT_2="오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요."
|
||||
BATCH_LANG_1="en"
|
||||
BATCH_LANG_2="ko"
|
||||
|
||||
# Long-form inference test data
|
||||
LONGFORM_VOICE_STYLE="assets/voice_styles/M1.json"
|
||||
LONGFORM_TEXT="This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues. The text chunking algorithm intelligently splits on paragraph and sentence boundaries, preserving the natural flow of the content. When a sentence is too long, it further splits on commas and spaces as needed. This multi-level approach ensures optimal chunk sizes for inference while maintaining linguistic coherence."
|
||||
|
||||
# Ask if user wants to clean results folders
|
||||
echo -e "Do you want to clean all results folders before running tests? (y/N): \c"
|
||||
read -r response
|
||||
if [[ "$response" =~ ^[Yy]$ ]]; then
|
||||
echo ""
|
||||
echo "Cleaning results folders..."
|
||||
|
||||
# List of result directories
|
||||
declare -a RESULT_DIRS=(
|
||||
"py/results"
|
||||
"nodejs/results"
|
||||
"go/results"
|
||||
"rust/results"
|
||||
"csharp/results"
|
||||
"java/results"
|
||||
"swift/results"
|
||||
"cpp/build/results"
|
||||
)
|
||||
|
||||
for dir in "${RESULT_DIRS[@]}"; do
|
||||
if [ -d "$SCRIPT_DIR/$dir" ]; then
|
||||
echo " - Cleaning $dir"
|
||||
rm -rf "$SCRIPT_DIR/$dir"/*
|
||||
fi
|
||||
done
|
||||
|
||||
echo "Results folders cleaned!"
|
||||
echo ""
|
||||
fi
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Track results
|
||||
declare -a PASSED=()
|
||||
declare -a FAILED=()
|
||||
|
||||
# Helper function to show statistics
|
||||
show_stats() {
|
||||
local name=$1
|
||||
local results_dir=$2
|
||||
|
||||
if [ -d "$results_dir" ]; then
|
||||
# Count .wav files
|
||||
local file_count=$(find "$results_dir" -name "*.wav" -type f 2>/dev/null | wc -l | tr -d ' ')
|
||||
|
||||
if [ "$file_count" -gt 0 ]; then
|
||||
# Calculate total size
|
||||
local total_size=0
|
||||
while IFS= read -r file; do
|
||||
if [ -f "$file" ]; then
|
||||
local size=$(stat -f%z "$file" 2>/dev/null || stat -c%s "$file" 2>/dev/null)
|
||||
total_size=$((total_size + size))
|
||||
fi
|
||||
done < <(find "$results_dir" -name "*.wav" -type f 2>/dev/null)
|
||||
|
||||
# Calculate statistics
|
||||
local total_size_mb=$(echo "scale=2; $total_size / 1024 / 1024" | bc)
|
||||
local avg_size_kb=$(echo "scale=2; $total_size / $file_count / 1024" | bc)
|
||||
|
||||
echo -e "${BLUE}[$name]${NC} 📊 Statistics:"
|
||||
echo -e "${BLUE}[$name]${NC} - Files generated: $file_count"
|
||||
echo -e "${BLUE}[$name]${NC} - Total size: ${total_size_mb} MB"
|
||||
echo -e "${BLUE}[$name]${NC} - Average file size: ${avg_size_kb} KB"
|
||||
fi
|
||||
fi
|
||||
}
|
||||
|
||||
# Helper function to run tests
|
||||
run_test() {
|
||||
local name=$1
|
||||
local dir=$2
|
||||
shift 2
|
||||
local cmd="$@"
|
||||
|
||||
echo -e "${BLUE}[$name]${NC} Running inference..."
|
||||
cd "$SCRIPT_DIR/$dir"
|
||||
|
||||
# Determine results directory based on the directory
|
||||
local results_dir="$SCRIPT_DIR/$dir/results"
|
||||
if [[ "$dir" == "cpp/build" ]]; then
|
||||
results_dir="$SCRIPT_DIR/cpp/build/results"
|
||||
fi
|
||||
|
||||
# Run command and prefix each output line with the language name
|
||||
if eval "$cmd" 2>&1 | sed "s/^/[$name] /"; then
|
||||
echo -e "${GREEN}[$name]${NC} ✓ Success"
|
||||
|
||||
# Show statistics
|
||||
show_stats "$name" "$results_dir"
|
||||
|
||||
PASSED+=("$name")
|
||||
else
|
||||
echo -e "${RED}[$name]${NC} ✗ Failed"
|
||||
FAILED+=("$name")
|
||||
fi
|
||||
echo ""
|
||||
cd "$SCRIPT_DIR"
|
||||
}
|
||||
|
||||
# ====================================
|
||||
# Python
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing Python...${NC}"
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "Python (default)" "py" "uv run example_onnx.py"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "Python (batch)" "py" "uv run example_onnx.py --batch --voice-style $BATCH_VOICE_STYLE_1 $BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1' '$BATCH_TEXT_2' --lang $BATCH_LANG_1 $BATCH_LANG_2"
|
||||
fi
|
||||
if [ "$TEST_LONGFORM" = true ]; then
|
||||
run_test "Python (long-form)" "py" "uv run example_onnx.py --voice-style $LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# JavaScript (Node.js)
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing JavaScript (Node.js)...${NC}"
|
||||
echo "Installing Node.js dependencies..."
|
||||
cd nodejs && npm install --silent && cd ..
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "JavaScript (default)" "nodejs" "node example_onnx.js"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "JavaScript (batch)" "nodejs" "node example_onnx.js --batch --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
|
||||
fi
|
||||
if [ "$TEST_LONGFORM" = true ]; then
|
||||
run_test "JavaScript (long-form)" "nodejs" "node example_onnx.js --voice-style $LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# Go
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing Go...${NC}"
|
||||
echo "Cleaning Go cache..."
|
||||
cd go && go clean && cd ..
|
||||
export ONNXRUNTIME_LIB_PATH=$(brew --prefix onnxruntime 2>/dev/null)/lib/libonnxruntime.dylib
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "Go (default)" "go" "go run example_onnx.go helper.go"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "Go (batch)" "go" "go run example_onnx.go helper.go --batch -voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 -text '$BATCH_TEXT_1|$BATCH_TEXT_2' -lang $BATCH_LANG_1,$BATCH_LANG_2"
|
||||
fi
|
||||
if [ "$TEST_LONGFORM" = true ]; then
|
||||
run_test "Go (long-form)" "go" "go run example_onnx.go helper.go -voice-style $LONGFORM_VOICE_STYLE -text '$LONGFORM_TEXT'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# Rust
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing Rust...${NC}"
|
||||
echo "Building Rust project..."
|
||||
cd rust && cargo clean && cd ..
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "Rust (default)" "rust" "cargo run --release"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "Rust (batch)" "rust" "cargo run --release -- --batch --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
|
||||
fi
|
||||
if [ "$TEST_LONGFORM" = true ]; then
|
||||
run_test "Rust (long-form)" "rust" "cargo run --release -- --voice-style $LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# C#
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing C#...${NC}"
|
||||
echo "Building C# project..."
|
||||
cd csharp && dotnet clean && cd ..
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "C# (default)" "csharp" "dotnet run --configuration Release"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "C# (batch)" "csharp" "dotnet run --configuration Release -- --batch --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
|
||||
fi
|
||||
if [ "$TEST_LONGFORM" = true ]; then
|
||||
run_test "C# (long-form)" "csharp" "dotnet run --configuration Release -- --voice-style ../$LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# Java
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing Java...${NC}"
|
||||
echo "Building Java project..."
|
||||
cd java && mvn clean install -q && cd ..
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "Java (default)" "java" "mvn exec:java -q"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "Java (batch)" "java" "mvn exec:java -q -Dexec.args='--batch --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text \"$BATCH_TEXT_1|$BATCH_TEXT_2\" --lang $BATCH_LANG_1,$BATCH_LANG_2'"
|
||||
fi
|
||||
if [ "$TEST_LONGFORM" = true ]; then
|
||||
run_test "Java (long-form)" "java" "mvn exec:java -q -Dexec.args='--voice-style $LONGFORM_VOICE_STYLE --text \"$LONGFORM_TEXT\"'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# Swift
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing Swift...${NC}"
|
||||
echo "Building Swift project..."
|
||||
cd swift && swift build -c release && cd ..
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "Swift (default)" "swift" ".build/release/example_onnx"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "Swift (batch)" "swift" ".build/release/example_onnx --batch --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
|
||||
fi
|
||||
if [ "$TEST_LONGFORM" = true ]; then
|
||||
run_test "Swift (long-form)" "swift" ".build/release/example_onnx --voice-style $LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# C++
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing C++...${NC}"
|
||||
echo "Building C++ project..."
|
||||
cd cpp && mkdir -p build && cd build && cmake .. && make && cd ../..
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "C++ (default)" "cpp/build" "./example_onnx"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "C++ (batch)" "cpp/build" "./example_onnx --batch --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
|
||||
fi
|
||||
if [ "$TEST_LONGFORM" = true ]; then
|
||||
run_test "C++ (long-form)" "cpp/build" "./example_onnx --voice-style ../$LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# Summary
|
||||
# ====================================
|
||||
echo "=================================="
|
||||
echo "Test Summary"
|
||||
echo "=================================="
|
||||
echo ""
|
||||
|
||||
if [ ${#PASSED[@]} -gt 0 ]; then
|
||||
echo -e "${GREEN}Passed (${#PASSED[@]}):${NC}"
|
||||
for lang in "${PASSED[@]}"; do
|
||||
echo -e " ${GREEN}✓${NC} $lang"
|
||||
done
|
||||
echo ""
|
||||
fi
|
||||
|
||||
if [ ${#FAILED[@]} -gt 0 ]; then
|
||||
echo -e "${RED}Failed (${#FAILED[@]}):${NC}"
|
||||
for lang in "${FAILED[@]}"; do
|
||||
echo -e " ${RED}✗${NC} $lang"
|
||||
done
|
||||
echo ""
|
||||
exit 1
|
||||
else
|
||||
echo -e "${GREEN}All tests passed! 🎉${NC}"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
4
web/.gitignore
vendored
Normal file
@@ -0,0 +1,4 @@
|
||||
node_modules/
|
||||
dist/
|
||||
.DS_Store
|
||||
*.log
|
||||
121
web/README.md
Normal file
@@ -0,0 +1,121 @@
|
||||
# Supertonic Web Example
|
||||
|
||||
This example demonstrates how to use Supertonic in a web browser using ONNX Runtime Web.
|
||||
|
||||
## 📰 Update News
|
||||
|
||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
||||
|
||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||
|
||||
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
|
||||
|
||||
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
|
||||
|
||||
**2025.11.19** - Added speed control slider to adjust speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
|
||||
|
||||
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
|
||||
|
||||
## Features
|
||||
|
||||
- 🌐 Runs entirely in the browser (no server required for inference)
|
||||
- 🚀 WebGPU support with automatic fallback to WebAssembly
|
||||
- 🌍 Multilingual support: English (en), Korean (ko), Spanish (es), Portuguese (pt), French (fr)
|
||||
- ⚡ Pre-extracted voice styles for instant generation
|
||||
- 🎨 Modern, responsive UI
|
||||
- 🎭 Multiple voice style presets (5 Male, 5 Female)
|
||||
- 💾 Download generated audio as WAV files
|
||||
- 📊 Detailed generation statistics (audio length, generation time)
|
||||
- ⏱️ Real-time progress tracking
|
||||
|
||||
## Requirements
|
||||
|
||||
- Node.js (for development server)
|
||||
- Modern web browser (Chrome, Edge, Firefox, Safari)
|
||||
|
||||
## Installation
|
||||
|
||||
1. Install dependencies:
|
||||
|
||||
```bash
|
||||
npm install
|
||||
```
|
||||
|
||||
## Running the Demo
|
||||
|
||||
Start the development server:
|
||||
|
||||
```bash
|
||||
npm run dev
|
||||
```
|
||||
|
||||
This will start a local development server (usually at http://localhost:3000) and open the demo in your browser.
|
||||
|
||||
## Usage
|
||||
|
||||
1. **Wait for Models to Load**: The app will automatically load models and the default voice style (M1)
|
||||
2. **Select Voice Style**: Choose from available voice presets
|
||||
- **Male 1-5 (M1-M5)**: Male voice styles
|
||||
- **Female 1-5 (F1-F5)**: Female voice styles
|
||||
3. **Select Language**: Choose the language that matches your input text
|
||||
- **English (en)**: Default language
|
||||
- **한국어 (ko)**: Korean
|
||||
- **Español (es)**: Spanish
|
||||
- **Português (pt)**: Portuguese
|
||||
- **Français (fr)**: French
|
||||
4. **Enter Text**: Type or paste the text you want to convert to speech
|
||||
5. **Adjust Settings** (optional):
|
||||
- **Total Steps**: More steps = better quality but slower (default: 5)
|
||||
6. **Generate Speech**: Click the "Generate Speech" button
|
||||
7. **View Results**:
|
||||
- See the full input text
|
||||
- View audio length and generation time statistics
|
||||
- Play the generated audio in the browser
|
||||
- Download as WAV file
|
||||
|
||||
## Multilingual Support
|
||||
|
||||
Supertonic 2 supports multiple languages. Make sure to select the correct language for your input text to get the best results. The model will automatically handle text preprocessing and pronunciation for the selected language.
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Browser Compatibility
|
||||
|
||||
This demo uses:
|
||||
- **ONNX Runtime Web**: For running models in the browser
|
||||
- **Web Audio API**: For playing generated audio
|
||||
- **Vite**: For development and bundling
|
||||
|
||||
## Notes
|
||||
|
||||
- The ONNX models must be accessible at `assets/onnx/` relative to the web root
|
||||
- Voice style JSON files must be accessible at `assets/voice_styles/` relative to the web root
|
||||
- Pre-extracted voice styles enable instant generation without audio processing
|
||||
- Ten voice style presets are provided (M1-M5, F1-F5)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Models not loading
|
||||
- Check browser console for errors
|
||||
- Ensure `assets/onnx/` path is correct and models are accessible
|
||||
- Check CORS settings if serving from a different domain
|
||||
|
||||
### WebGPU not available
|
||||
- WebGPU is only available in recent Chrome/Edge browsers (version 113+)
|
||||
- The app will automatically fall back to WebAssembly if WebGPU is not available
|
||||
- Check the backend badge to see which execution provider is being used
|
||||
|
||||
### Out of memory errors
|
||||
- Try shorter text inputs
|
||||
- Reduce denoising steps
|
||||
- Use a browser with more available memory
|
||||
- Close other tabs to free up memory
|
||||
|
||||
### Audio quality issues
|
||||
- Try different voice style presets
|
||||
- Increase denoising steps for better quality
|
||||
|
||||
### Slow generation
|
||||
- If using WebAssembly, try a browser that supports WebGPU
|
||||
- Ensure no other heavy processes are running
|
||||
- Consider using fewer denoising steps for faster (but lower quality) results
|
||||
561
web/helper.js
Normal file
@@ -0,0 +1,561 @@
|
||||
import * as ort from 'onnxruntime-web';
|
||||
|
||||
// Available languages for multilingual TTS
|
||||
export const AVAILABLE_LANGS = ['en', 'ko', 'es', 'pt', 'fr'];
|
||||
|
||||
export function isValidLang(lang) {
|
||||
return AVAILABLE_LANGS.includes(lang);
|
||||
}
|
||||
|
||||
/**
|
||||
* Unicode Text Processor
|
||||
*/
|
||||
export class UnicodeProcessor {
|
||||
constructor(indexer) {
|
||||
this.indexer = indexer;
|
||||
}
|
||||
|
||||
call(textList, langList) {
|
||||
const processedTexts = textList.map((text, i) => this.preprocessText(text, langList[i]));
|
||||
|
||||
const textIdsLengths = processedTexts.map(text => text.length);
|
||||
const maxLen = Math.max(...textIdsLengths);
|
||||
|
||||
const textIds = processedTexts.map(text => {
|
||||
const row = new Array(maxLen).fill(0);
|
||||
for (let j = 0; j < text.length; j++) {
|
||||
const codePoint = text.codePointAt(j);
|
||||
row[j] = (codePoint < this.indexer.length) ? this.indexer[codePoint] : -1;
|
||||
}
|
||||
return row;
|
||||
});
|
||||
|
||||
const textMask = this.getTextMask(textIdsLengths);
|
||||
return { textIds, textMask };
|
||||
}
|
||||
|
||||
preprocessText(text, lang) {
|
||||
// TODO: Need advanced normalizer for better performance
|
||||
text = text.normalize('NFKD');
|
||||
|
||||
// Remove emojis (wide Unicode range)
|
||||
const emojiPattern = /[\u{1F600}-\u{1F64F}\u{1F300}-\u{1F5FF}\u{1F680}-\u{1F6FF}\u{1F700}-\u{1F77F}\u{1F780}-\u{1F7FF}\u{1F800}-\u{1F8FF}\u{1F900}-\u{1F9FF}\u{1FA00}-\u{1FA6F}\u{1FA70}-\u{1FAFF}\u{2600}-\u{26FF}\u{2700}-\u{27BF}\u{1F1E6}-\u{1F1FF}]+/gu;
|
||||
text = text.replace(emojiPattern, '');
|
||||
|
||||
// Replace various dashes and symbols
|
||||
const replacements = {
|
||||
'–': '-',
|
||||
'‑': '-',
|
||||
'—': '-',
|
||||
'_': ' ',
|
||||
'\u201C': '"', // left double quote "
|
||||
'\u201D': '"', // right double quote "
|
||||
'\u2018': "'", // left single quote '
|
||||
'\u2019': "'", // right single quote '
|
||||
'´': "'",
|
||||
'`': "'",
|
||||
'[': ' ',
|
||||
']': ' ',
|
||||
'|': ' ',
|
||||
'/': ' ',
|
||||
'#': ' ',
|
||||
'→': ' ',
|
||||
'←': ' ',
|
||||
};
|
||||
for (const [k, v] of Object.entries(replacements)) {
|
||||
text = text.replaceAll(k, v);
|
||||
}
|
||||
|
||||
// Remove special symbols
|
||||
text = text.replace(/[♥☆♡©\\]/g, '');
|
||||
|
||||
// Replace known expressions
|
||||
const exprReplacements = {
|
||||
'@': ' at ',
|
||||
'e.g.,': 'for example, ',
|
||||
'i.e.,': 'that is, ',
|
||||
};
|
||||
for (const [k, v] of Object.entries(exprReplacements)) {
|
||||
text = text.replaceAll(k, v);
|
||||
}
|
||||
|
||||
// Fix spacing around punctuation
|
||||
text = text.replace(/ ,/g, ',');
|
||||
text = text.replace(/ \./g, '.');
|
||||
text = text.replace(/ !/g, '!');
|
||||
text = text.replace(/ \?/g, '?');
|
||||
text = text.replace(/ ;/g, ';');
|
||||
text = text.replace(/ :/g, ':');
|
||||
text = text.replace(/ '/g, "'");
|
||||
|
||||
// Remove duplicate quotes
|
||||
while (text.includes('""')) {
|
||||
text = text.replace('""', '"');
|
||||
}
|
||||
while (text.includes("''")) {
|
||||
text = text.replace("''", "'");
|
||||
}
|
||||
while (text.includes('``')) {
|
||||
text = text.replace('``', '`');
|
||||
}
|
||||
|
||||
// Remove extra spaces
|
||||
text = text.replace(/\s+/g, ' ').trim();
|
||||
|
||||
// If text doesn't end with punctuation, quotes, or closing brackets, add a period
|
||||
if (!/[.!?;:,'\"')\]}…。」』】〉》›»]$/.test(text)) {
|
||||
text += '.';
|
||||
}
|
||||
|
||||
// Validate language
|
||||
if (!isValidLang(lang)) {
|
||||
throw new Error(`Invalid language: ${lang}. Available: ${AVAILABLE_LANGS.join(', ')}`);
|
||||
}
|
||||
|
||||
// Wrap text with language tags
|
||||
text = `<${lang}>${text}</${lang}>`;
|
||||
|
||||
return text;
|
||||
}
|
||||
|
||||
getTextMask(textIdsLengths) {
|
||||
const maxLen = Math.max(...textIdsLengths);
|
||||
return this.lengthToMask(textIdsLengths, maxLen);
|
||||
}
|
||||
|
||||
lengthToMask(lengths, maxLen = null) {
|
||||
const actualMaxLen = maxLen || Math.max(...lengths);
|
||||
return lengths.map(len => {
|
||||
const row = new Array(actualMaxLen).fill(0.0);
|
||||
for (let j = 0; j < Math.min(len, actualMaxLen); j++) {
|
||||
row[j] = 1.0;
|
||||
}
|
||||
return [row];
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Style class to hold TTL and DP tensors
|
||||
*/
|
||||
export class Style {
|
||||
constructor(ttlTensor, dpTensor) {
|
||||
this.ttl = ttlTensor;
|
||||
this.dp = dpTensor;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Text-to-Speech class
|
||||
*/
|
||||
export class TextToSpeech {
|
||||
constructor(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) {
|
||||
this.cfgs = cfgs;
|
||||
this.textProcessor = textProcessor;
|
||||
this.dpOrt = dpOrt;
|
||||
this.textEncOrt = textEncOrt;
|
||||
this.vectorEstOrt = vectorEstOrt;
|
||||
this.vocoderOrt = vocoderOrt;
|
||||
this.sampleRate = cfgs.ae.sample_rate;
|
||||
}
|
||||
|
||||
async _infer(textList, langList, style, totalStep, speed = 1.05, progressCallback = null) {
|
||||
const bsz = textList.length;
|
||||
|
||||
// Process text
|
||||
const { textIds, textMask } = this.textProcessor.call(textList, langList);
|
||||
|
||||
const textIdsFlat = new BigInt64Array(textIds.flat().map(x => BigInt(x)));
|
||||
const textIdsShape = [bsz, textIds[0].length];
|
||||
const textIdsTensor = new ort.Tensor('int64', textIdsFlat, textIdsShape);
|
||||
|
||||
const textMaskFlat = new Float32Array(textMask.flat(2));
|
||||
const textMaskShape = [bsz, 1, textMask[0][0].length];
|
||||
const textMaskTensor = new ort.Tensor('float32', textMaskFlat, textMaskShape);
|
||||
|
||||
// Predict duration
|
||||
const dpOutputs = await this.dpOrt.run({
|
||||
text_ids: textIdsTensor,
|
||||
style_dp: style.dp,
|
||||
text_mask: textMaskTensor
|
||||
});
|
||||
const duration = Array.from(dpOutputs.duration.data);
|
||||
|
||||
// Apply speed factor to duration
|
||||
for (let i = 0; i < duration.length; i++) {
|
||||
duration[i] /= speed;
|
||||
}
|
||||
|
||||
// Encode text
|
||||
const textEncOutputs = await this.textEncOrt.run({
|
||||
text_ids: textIdsTensor,
|
||||
style_ttl: style.ttl,
|
||||
text_mask: textMaskTensor
|
||||
});
|
||||
const textEmb = textEncOutputs.text_emb;
|
||||
|
||||
// Sample noisy latent
|
||||
let { xt, latentMask } = this.sampleNoisyLatent(
|
||||
duration,
|
||||
this.sampleRate,
|
||||
this.cfgs.ae.base_chunk_size,
|
||||
this.cfgs.ttl.chunk_compress_factor,
|
||||
this.cfgs.ttl.latent_dim
|
||||
);
|
||||
|
||||
const latentMaskFlat = new Float32Array(latentMask.flat(2));
|
||||
const latentMaskShape = [bsz, 1, latentMask[0][0].length];
|
||||
const latentMaskTensor = new ort.Tensor('float32', latentMaskFlat, latentMaskShape);
|
||||
|
||||
// Prepare constant arrays
|
||||
const totalStepArray = new Float32Array(bsz).fill(totalStep);
|
||||
const totalStepTensor = new ort.Tensor('float32', totalStepArray, [bsz]);
|
||||
|
||||
// Denoising loop
|
||||
for (let step = 0; step < totalStep; step++) {
|
||||
if (progressCallback) {
|
||||
progressCallback(step + 1, totalStep);
|
||||
}
|
||||
|
||||
const currentStepArray = new Float32Array(bsz).fill(step);
|
||||
const currentStepTensor = new ort.Tensor('float32', currentStepArray, [bsz]);
|
||||
|
||||
const xtFlat = new Float32Array(xt.flat(2));
|
||||
const xtShape = [bsz, xt[0].length, xt[0][0].length];
|
||||
const xtTensor = new ort.Tensor('float32', xtFlat, xtShape);
|
||||
|
||||
const vectorEstOutputs = await this.vectorEstOrt.run({
|
||||
noisy_latent: xtTensor,
|
||||
text_emb: textEmb,
|
||||
style_ttl: style.ttl,
|
||||
latent_mask: latentMaskTensor,
|
||||
text_mask: textMaskTensor,
|
||||
current_step: currentStepTensor,
|
||||
total_step: totalStepTensor
|
||||
});
|
||||
|
||||
const denoised = Array.from(vectorEstOutputs.denoised_latent.data);
|
||||
|
||||
// Reshape to 3D
|
||||
const latentDim = xt[0].length;
|
||||
const latentLen = xt[0][0].length;
|
||||
xt = [];
|
||||
let idx = 0;
|
||||
for (let b = 0; b < bsz; b++) {
|
||||
const batch = [];
|
||||
for (let d = 0; d < latentDim; d++) {
|
||||
const row = [];
|
||||
for (let t = 0; t < latentLen; t++) {
|
||||
row.push(denoised[idx++]);
|
||||
}
|
||||
batch.push(row);
|
||||
}
|
||||
xt.push(batch);
|
||||
}
|
||||
}
|
||||
|
||||
// Generate waveform
|
||||
const finalXtFlat = new Float32Array(xt.flat(2));
|
||||
const finalXtShape = [bsz, xt[0].length, xt[0][0].length];
|
||||
const finalXtTensor = new ort.Tensor('float32', finalXtFlat, finalXtShape);
|
||||
|
||||
const vocoderOutputs = await this.vocoderOrt.run({
|
||||
latent: finalXtTensor
|
||||
});
|
||||
|
||||
const wav = Array.from(vocoderOutputs.wav_tts.data);
|
||||
|
||||
return { wav, duration };
|
||||
}
|
||||
|
||||
async call(text, lang, style, totalStep, speed = 1.05, silenceDuration = 0.3, progressCallback = null) {
|
||||
if (style.ttl.dims[0] !== 1) {
|
||||
throw new Error('Single speaker text to speech only supports single style');
|
||||
}
|
||||
const maxLen = lang === 'ko' ? 120 : 300;
|
||||
const textList = chunkText(text, maxLen);
|
||||
const langList = new Array(textList.length).fill(lang);
|
||||
let wavCat = [];
|
||||
let durCat = 0;
|
||||
|
||||
for (let i = 0; i < textList.length; i++) {
|
||||
const { wav, duration } = await this._infer([textList[i]], [langList[i]], style, totalStep, speed, progressCallback);
|
||||
|
||||
if (wavCat.length === 0) {
|
||||
wavCat = wav;
|
||||
durCat = duration[0];
|
||||
} else {
|
||||
const silenceLen = Math.floor(silenceDuration * this.sampleRate);
|
||||
const silence = new Array(silenceLen).fill(0);
|
||||
wavCat = [...wavCat, ...silence, ...wav];
|
||||
durCat += duration[0] + silenceDuration;
|
||||
}
|
||||
}
|
||||
|
||||
return { wav: wavCat, duration: [durCat] };
|
||||
}
|
||||
|
||||
async batch(textList, langList, style, totalStep, speed = 1.05, progressCallback = null) {
|
||||
return await this._infer(textList, langList, style, totalStep, speed, progressCallback);
|
||||
}
|
||||
|
||||
sampleNoisyLatent(duration, sampleRate, baseChunkSize, chunkCompress, latentDim) {
|
||||
const bsz = duration.length;
|
||||
const maxDur = Math.max(...duration);
|
||||
|
||||
const wavLenMax = Math.floor(maxDur * sampleRate);
|
||||
const wavLengths = duration.map(d => Math.floor(d * sampleRate));
|
||||
|
||||
const chunkSize = baseChunkSize * chunkCompress;
|
||||
const latentLen = Math.floor((wavLenMax + chunkSize - 1) / chunkSize);
|
||||
const latentDimVal = latentDim * chunkCompress;
|
||||
|
||||
const xt = [];
|
||||
for (let b = 0; b < bsz; b++) {
|
||||
const batch = [];
|
||||
for (let d = 0; d < latentDimVal; d++) {
|
||||
const row = [];
|
||||
for (let t = 0; t < latentLen; t++) {
|
||||
// Box-Muller transform
|
||||
const u1 = Math.max(0.0001, Math.random());
|
||||
const u2 = Math.random();
|
||||
const val = Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2);
|
||||
row.push(val);
|
||||
}
|
||||
batch.push(row);
|
||||
}
|
||||
xt.push(batch);
|
||||
}
|
||||
|
||||
const latentLengths = wavLengths.map(len => Math.floor((len + chunkSize - 1) / chunkSize));
|
||||
const latentMask = this.lengthToMask(latentLengths, latentLen);
|
||||
|
||||
// Apply mask
|
||||
for (let b = 0; b < bsz; b++) {
|
||||
for (let d = 0; d < latentDimVal; d++) {
|
||||
for (let t = 0; t < latentLen; t++) {
|
||||
xt[b][d][t] *= latentMask[b][0][t];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return { xt, latentMask };
|
||||
}
|
||||
|
||||
lengthToMask(lengths, maxLen = null) {
|
||||
const actualMaxLen = maxLen || Math.max(...lengths);
|
||||
return lengths.map(len => {
|
||||
const row = new Array(actualMaxLen).fill(0.0);
|
||||
for (let j = 0; j < Math.min(len, actualMaxLen); j++) {
|
||||
row[j] = 1.0;
|
||||
}
|
||||
return [row];
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Load voice style from JSON files
|
||||
*/
|
||||
export async function loadVoiceStyle(voiceStylePaths, verbose = false) {
|
||||
const bsz = voiceStylePaths.length;
|
||||
|
||||
// Read first file to get dimensions
|
||||
const firstResponse = await fetch(voiceStylePaths[0]);
|
||||
const firstStyle = await firstResponse.json();
|
||||
|
||||
const ttlDims = firstStyle.style_ttl.dims;
|
||||
const dpDims = firstStyle.style_dp.dims;
|
||||
|
||||
const ttlDim1 = ttlDims[1];
|
||||
const ttlDim2 = ttlDims[2];
|
||||
const dpDim1 = dpDims[1];
|
||||
const dpDim2 = dpDims[2];
|
||||
|
||||
// Pre-allocate arrays with full batch size
|
||||
const ttlSize = bsz * ttlDim1 * ttlDim2;
|
||||
const dpSize = bsz * dpDim1 * dpDim2;
|
||||
const ttlFlat = new Float32Array(ttlSize);
|
||||
const dpFlat = new Float32Array(dpSize);
|
||||
|
||||
// Fill in the data
|
||||
for (let i = 0; i < bsz; i++) {
|
||||
const response = await fetch(voiceStylePaths[i]);
|
||||
const voiceStyle = await response.json();
|
||||
|
||||
// Flatten TTL data
|
||||
const ttlData = voiceStyle.style_ttl.data.flat(Infinity);
|
||||
const ttlOffset = i * ttlDim1 * ttlDim2;
|
||||
ttlFlat.set(ttlData, ttlOffset);
|
||||
|
||||
// Flatten DP data
|
||||
const dpData = voiceStyle.style_dp.data.flat(Infinity);
|
||||
const dpOffset = i * dpDim1 * dpDim2;
|
||||
dpFlat.set(dpData, dpOffset);
|
||||
}
|
||||
|
||||
const ttlShape = [bsz, ttlDim1, ttlDim2];
|
||||
const dpShape = [bsz, dpDim1, dpDim2];
|
||||
|
||||
const ttlTensor = new ort.Tensor('float32', ttlFlat, ttlShape);
|
||||
const dpTensor = new ort.Tensor('float32', dpFlat, dpShape);
|
||||
|
||||
if (verbose) {
|
||||
console.log(`Loaded ${bsz} voice styles`);
|
||||
}
|
||||
|
||||
return new Style(ttlTensor, dpTensor);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load configuration from JSON
|
||||
*/
|
||||
export async function loadCfgs(onnxDir) {
|
||||
const response = await fetch(`${onnxDir}/tts.json`);
|
||||
const cfgs = await response.json();
|
||||
return cfgs;
|
||||
}
|
||||
|
||||
/**
|
||||
* Load text processor
|
||||
*/
|
||||
export async function loadTextProcessor(onnxDir) {
|
||||
const response = await fetch(`${onnxDir}/unicode_indexer.json`);
|
||||
const indexer = await response.json();
|
||||
return new UnicodeProcessor(indexer);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load ONNX model
|
||||
*/
|
||||
export async function loadOnnx(onnxPath, options) {
|
||||
const session = await ort.InferenceSession.create(onnxPath, options);
|
||||
return session;
|
||||
}
|
||||
|
||||
/**
|
||||
* Load all TTS components
|
||||
*/
|
||||
export async function loadTextToSpeech(onnxDir, sessionOptions = {}, progressCallback = null) {
|
||||
console.log('Using WebAssembly/WebGPU for inference');
|
||||
|
||||
const cfgs = await loadCfgs(onnxDir);
|
||||
|
||||
const dpPath = `${onnxDir}/duration_predictor.onnx`;
|
||||
const textEncPath = `${onnxDir}/text_encoder.onnx`;
|
||||
const vectorEstPath = `${onnxDir}/vector_estimator.onnx`;
|
||||
const vocoderPath = `${onnxDir}/vocoder.onnx`;
|
||||
|
||||
const modelPaths = [
|
||||
{ name: 'Duration Predictor', path: dpPath },
|
||||
{ name: 'Text Encoder', path: textEncPath },
|
||||
{ name: 'Vector Estimator', path: vectorEstPath },
|
||||
{ name: 'Vocoder', path: vocoderPath }
|
||||
];
|
||||
|
||||
const sessions = [];
|
||||
for (let i = 0; i < modelPaths.length; i++) {
|
||||
if (progressCallback) {
|
||||
progressCallback(modelPaths[i].name, i + 1, modelPaths.length);
|
||||
}
|
||||
const session = await loadOnnx(modelPaths[i].path, sessionOptions);
|
||||
sessions.push(session);
|
||||
}
|
||||
|
||||
const [dpOrt, textEncOrt, vectorEstOrt, vocoderOrt] = sessions;
|
||||
|
||||
const textProcessor = await loadTextProcessor(onnxDir);
|
||||
const textToSpeech = new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
|
||||
|
||||
return { textToSpeech, cfgs };
|
||||
}
|
||||
|
||||
/**
|
||||
* Chunk text into manageable segments
|
||||
*/
|
||||
function chunkText(text, maxLen = 300) {
|
||||
if (typeof text !== 'string') {
|
||||
throw new Error(`chunkText expects a string, got ${typeof text}`);
|
||||
}
|
||||
|
||||
// Split by paragraph (two or more newlines)
|
||||
const paragraphs = text.trim().split(/\n\s*\n+/).filter(p => p.trim());
|
||||
|
||||
const chunks = [];
|
||||
|
||||
for (let paragraph of paragraphs) {
|
||||
paragraph = paragraph.trim();
|
||||
if (!paragraph) continue;
|
||||
|
||||
// Split by sentence boundaries (period, question mark, exclamation mark followed by space)
|
||||
// But exclude common abbreviations like Mr., Mrs., Dr., etc. and single capital letters like F.
|
||||
const sentences = paragraph.split(/(?<!Mr\.|Mrs\.|Ms\.|Dr\.|Prof\.|Sr\.|Jr\.|Ph\.D\.|etc\.|e\.g\.|i\.e\.|vs\.|Inc\.|Ltd\.|Co\.|Corp\.|St\.|Ave\.|Blvd\.)(?<!\b[A-Z]\.)(?<=[.!?])\s+/);
|
||||
|
||||
let currentChunk = "";
|
||||
|
||||
for (let sentence of sentences) {
|
||||
if (currentChunk.length + sentence.length + 1 <= maxLen) {
|
||||
currentChunk += (currentChunk ? " " : "") + sentence;
|
||||
} else {
|
||||
if (currentChunk) {
|
||||
chunks.push(currentChunk.trim());
|
||||
}
|
||||
currentChunk = sentence;
|
||||
}
|
||||
}
|
||||
|
||||
if (currentChunk) {
|
||||
chunks.push(currentChunk.trim());
|
||||
}
|
||||
}
|
||||
|
||||
return chunks;
|
||||
}
|
||||
|
||||
/**
|
||||
* Write WAV file to ArrayBuffer
|
||||
*/
|
||||
export function writeWavFile(audioData, sampleRate) {
|
||||
const numChannels = 1;
|
||||
const bitsPerSample = 16;
|
||||
const byteRate = sampleRate * numChannels * bitsPerSample / 8;
|
||||
const blockAlign = numChannels * bitsPerSample / 8;
|
||||
const dataSize = audioData.length * 2;
|
||||
|
||||
// Create ArrayBuffer
|
||||
const buffer = new ArrayBuffer(44 + dataSize);
|
||||
const view = new DataView(buffer);
|
||||
|
||||
// Write WAV header
|
||||
const writeString = (offset, string) => {
|
||||
for (let i = 0; i < string.length; i++) {
|
||||
view.setUint8(offset + i, string.charCodeAt(i));
|
||||
}
|
||||
};
|
||||
|
||||
writeString(0, 'RIFF');
|
||||
view.setUint32(4, 36 + dataSize, true);
|
||||
writeString(8, 'WAVE');
|
||||
writeString(12, 'fmt ');
|
||||
view.setUint32(16, 16, true);
|
||||
view.setUint16(20, 1, true); // PCM
|
||||
view.setUint16(22, numChannels, true);
|
||||
view.setUint32(24, sampleRate, true);
|
||||
view.setUint32(28, byteRate, true);
|
||||
view.setUint16(32, blockAlign, true);
|
||||
view.setUint16(34, bitsPerSample, true);
|
||||
writeString(36, 'data');
|
||||
view.setUint32(40, dataSize, true);
|
||||
|
||||
// Write audio data
|
||||
const int16Data = new Int16Array(audioData.length);
|
||||
for (let i = 0; i < audioData.length; i++) {
|
||||
const clamped = Math.max(-1.0, Math.min(1.0, audioData[i]));
|
||||
int16Data[i] = Math.floor(clamped * 32767);
|
||||
}
|
||||
|
||||
const dataView = new Uint8Array(buffer, 44);
|
||||
dataView.set(new Uint8Array(int16Data.buffer));
|
||||
|
||||
return buffer;
|
||||
}
|
||||
95
web/index.html
Normal file
@@ -0,0 +1,95 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Supertonic - Web Demo</title>
|
||||
<link rel="stylesheet" href="/style.css">
|
||||
</head>
|
||||
<body>
|
||||
<div class="container">
|
||||
<h1>🎤 Supertonic 2</h1>
|
||||
<p class="subtitle">Multilingual Text-to-Speech with ONNX Runtime Web</p>
|
||||
|
||||
<div id="statusBox" class="status-box">
|
||||
<div class="status-text-wrapper">
|
||||
<div id="statusText">ℹ️ <strong>Loading models...</strong>
|
||||
Please wait...</div>
|
||||
</div>
|
||||
<div id="backendBadge" class="backend-badge">WebAssembly</div>
|
||||
</div>
|
||||
|
||||
<div class="main-content">
|
||||
<div class="left-panel">
|
||||
<div class="section">
|
||||
<div class="ref-audio-label">
|
||||
<label for="voiceStyleSelect">Voice Style: </label>
|
||||
<span id="voiceStyleInfo"
|
||||
class="ref-audio-info">Loading...</span>
|
||||
</div>
|
||||
<select id="voiceStyleSelect">
|
||||
<option value="assets/voice_styles/M1.json">Male 1 (M1)</option>
|
||||
<option value="assets/voice_styles/M2.json">Male 2 (M2)</option>
|
||||
<option value="assets/voice_styles/M3.json">Male 3 (M3)</option>
|
||||
<option value="assets/voice_styles/M4.json">Male 4 (M4)</option>
|
||||
<option value="assets/voice_styles/M5.json">Male 5 (M5)</option>
|
||||
<option value="assets/voice_styles/F1.json">Female 1 (F1)</option>
|
||||
<option value="assets/voice_styles/F2.json">Female 2 (F2)</option>
|
||||
<option value="assets/voice_styles/F3.json">Female 3 (F3)</option>
|
||||
<option value="assets/voice_styles/F4.json">Female 4 (F4)</option>
|
||||
<option value="assets/voice_styles/F5.json">Female 5 (F5)</option>
|
||||
</select>
|
||||
</div>
|
||||
|
||||
<div class="section">
|
||||
<label for="langSelect">Language:</label>
|
||||
<select id="langSelect">
|
||||
<option value="en" selected>English (en)</option>
|
||||
<option value="ko">한국어 (ko)</option>
|
||||
<option value="es">Español (es)</option>
|
||||
<option value="pt">Português (pt)</option>
|
||||
<option value="fr">Français (fr)</option>
|
||||
</select>
|
||||
</div>
|
||||
|
||||
<div class="section">
|
||||
<label for="text">Text to Synthesize:</label>
|
||||
<textarea id="text"
|
||||
placeholder="Enter the text you want to convert to speech...">This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.</textarea>
|
||||
</div>
|
||||
|
||||
<div class="params-grid">
|
||||
<div class="section">
|
||||
<label for="totalStep">Total Steps (higher = better
|
||||
quality):</label>
|
||||
<input type="number" id="totalStep" value="5"
|
||||
min="1" max="50">
|
||||
</div>
|
||||
|
||||
<div class="section">
|
||||
<label for="speed">Speed (0.9-1.5 recommended):</label>
|
||||
<input type="number" id="speed" value="1.05"
|
||||
min="0.5" max="2.0" step="0.05">
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
<button id="generateBtn">Generate Speech</button>
|
||||
|
||||
<div id="error" class="error"></div>
|
||||
</div>
|
||||
|
||||
<div class="right-panel">
|
||||
<div id="results" class="results">
|
||||
<div class="results-placeholder">
|
||||
<div class="results-placeholder-icon">🎤</div>
|
||||
<p>Generated speech will appear here</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script type="module" src="/main.js"></script>
|
||||
</body>
|
||||
</html>
|
||||
291
web/main.js
Normal file
@@ -0,0 +1,291 @@
|
||||
import {
|
||||
loadTextToSpeech,
|
||||
loadVoiceStyle,
|
||||
writeWavFile
|
||||
} from './helper.js';
|
||||
|
||||
// Configuration
|
||||
const DEFAULT_VOICE_STYLE_PATH = 'assets/voice_styles/M1.json';
|
||||
|
||||
// Helper function to extract filename from path
|
||||
function getFilenameFromPath(path) {
|
||||
return path.split('/').pop();
|
||||
}
|
||||
|
||||
// Global state
|
||||
let textToSpeech = null;
|
||||
let cfgs = null;
|
||||
|
||||
// Pre-computed style
|
||||
let currentStyle = null;
|
||||
let currentStylePath = DEFAULT_VOICE_STYLE_PATH;
|
||||
|
||||
// UI Elements
|
||||
const textInput = document.getElementById('text');
|
||||
const voiceStyleSelect = document.getElementById('voiceStyleSelect');
|
||||
const voiceStyleInfo = document.getElementById('voiceStyleInfo');
|
||||
const langSelect = document.getElementById('langSelect');
|
||||
const totalStepInput = document.getElementById('totalStep');
|
||||
const speedInput = document.getElementById('speed');
|
||||
const generateBtn = document.getElementById('generateBtn');
|
||||
const statusBox = document.getElementById('statusBox');
|
||||
const statusText = document.getElementById('statusText');
|
||||
const backendBadge = document.getElementById('backendBadge');
|
||||
const resultsContainer = document.getElementById('results');
|
||||
const errorBox = document.getElementById('error');
|
||||
|
||||
function showStatus(message, type = 'info') {
|
||||
statusText.innerHTML = message;
|
||||
statusBox.className = 'status-box';
|
||||
if (type === 'success') {
|
||||
statusBox.classList.add('success');
|
||||
} else if (type === 'error') {
|
||||
statusBox.classList.add('error');
|
||||
}
|
||||
}
|
||||
|
||||
function showError(message) {
|
||||
errorBox.textContent = message;
|
||||
errorBox.classList.add('active');
|
||||
}
|
||||
|
||||
function hideError() {
|
||||
errorBox.classList.remove('active');
|
||||
}
|
||||
|
||||
function showBackendBadge() {
|
||||
backendBadge.classList.add('visible');
|
||||
}
|
||||
|
||||
// Load voice style from JSON
|
||||
async function loadStyleFromJSON(stylePath) {
|
||||
try {
|
||||
const style = await loadVoiceStyle([stylePath], true);
|
||||
return style;
|
||||
} catch (error) {
|
||||
console.error('Error loading voice style:', error);
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
// Load models on page load
|
||||
async function initializeModels() {
|
||||
try {
|
||||
showStatus('ℹ️ <strong>Loading configuration...</strong>');
|
||||
|
||||
const basePath = 'assets/onnx';
|
||||
|
||||
// Try WebGPU first, fallback to WASM
|
||||
let executionProvider = 'wasm';
|
||||
try {
|
||||
const result = await loadTextToSpeech(basePath, {
|
||||
executionProviders: ['webgpu'],
|
||||
graphOptimizationLevel: 'all'
|
||||
}, (modelName, current, total) => {
|
||||
showStatus(`ℹ️ <strong>Loading ONNX models (${current}/${total}):</strong> ${modelName}...`);
|
||||
});
|
||||
|
||||
textToSpeech = result.textToSpeech;
|
||||
cfgs = result.cfgs;
|
||||
|
||||
executionProvider = 'webgpu';
|
||||
backendBadge.textContent = 'WebGPU';
|
||||
backendBadge.style.background = '#4caf50';
|
||||
} catch (webgpuError) {
|
||||
console.log('WebGPU not available, falling back to WebAssembly');
|
||||
|
||||
const result = await loadTextToSpeech(basePath, {
|
||||
executionProviders: ['wasm'],
|
||||
graphOptimizationLevel: 'all'
|
||||
}, (modelName, current, total) => {
|
||||
showStatus(`ℹ️ <strong>Loading ONNX models (${current}/${total}):</strong> ${modelName}...`);
|
||||
});
|
||||
|
||||
textToSpeech = result.textToSpeech;
|
||||
cfgs = result.cfgs;
|
||||
}
|
||||
|
||||
showStatus('ℹ️ <strong>Loading default voice style...</strong>');
|
||||
|
||||
// Load default voice style
|
||||
currentStyle = await loadStyleFromJSON(currentStylePath);
|
||||
voiceStyleInfo.textContent = `${getFilenameFromPath(currentStylePath)} (default)`;
|
||||
|
||||
showStatus(`✅ <strong>Models loaded!</strong> Using ${executionProvider.toUpperCase()}. You can now generate speech.`, 'success');
|
||||
showBackendBadge();
|
||||
|
||||
generateBtn.disabled = false;
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error loading models:', error);
|
||||
showStatus(`❌ <strong>Error loading models:</strong> ${error.message}`, 'error');
|
||||
}
|
||||
}
|
||||
|
||||
// Handle voice style selection
|
||||
voiceStyleSelect.addEventListener('change', async (e) => {
|
||||
const selectedValue = e.target.value;
|
||||
|
||||
if (!selectedValue) return;
|
||||
|
||||
try {
|
||||
generateBtn.disabled = true;
|
||||
showStatus(`ℹ️ <strong>Loading voice style...</strong>`, 'info');
|
||||
|
||||
currentStylePath = selectedValue;
|
||||
currentStyle = await loadStyleFromJSON(currentStylePath);
|
||||
voiceStyleInfo.textContent = getFilenameFromPath(currentStylePath);
|
||||
|
||||
showStatus(`✅ <strong>Voice style loaded:</strong> ${getFilenameFromPath(currentStylePath)}`, 'success');
|
||||
generateBtn.disabled = false;
|
||||
} catch (error) {
|
||||
showError(`Error loading voice style: ${error.message}`);
|
||||
|
||||
// Restore default style
|
||||
currentStylePath = DEFAULT_VOICE_STYLE_PATH;
|
||||
voiceStyleSelect.value = currentStylePath;
|
||||
try {
|
||||
currentStyle = await loadStyleFromJSON(currentStylePath);
|
||||
voiceStyleInfo.textContent = `${getFilenameFromPath(currentStylePath)} (default)`;
|
||||
} catch (styleError) {
|
||||
console.error('Error restoring default style:', styleError);
|
||||
}
|
||||
|
||||
generateBtn.disabled = false;
|
||||
}
|
||||
});
|
||||
|
||||
// Main synthesis function
|
||||
async function generateSpeech() {
|
||||
const text = textInput.value.trim();
|
||||
if (!text) {
|
||||
showError('Please enter some text to synthesize.');
|
||||
return;
|
||||
}
|
||||
|
||||
if (!textToSpeech || !cfgs) {
|
||||
showError('Models are still loading. Please wait.');
|
||||
return;
|
||||
}
|
||||
|
||||
if (!currentStyle) {
|
||||
showError('Voice style is not ready. Please wait.');
|
||||
return;
|
||||
}
|
||||
|
||||
const startTime = Date.now();
|
||||
|
||||
try {
|
||||
generateBtn.disabled = true;
|
||||
hideError();
|
||||
|
||||
// Clear results and show placeholder
|
||||
resultsContainer.innerHTML = `
|
||||
<div class="results-placeholder generating">
|
||||
<div class="results-placeholder-icon">⏳</div>
|
||||
<p>Generating speech...</p>
|
||||
</div>
|
||||
`;
|
||||
|
||||
const totalStep = parseInt(totalStepInput.value);
|
||||
const speed = parseFloat(speedInput.value);
|
||||
const lang = langSelect.value;
|
||||
|
||||
showStatus('ℹ️ <strong>Generating speech from text...</strong>');
|
||||
const tic = Date.now();
|
||||
|
||||
const { wav, duration } = await textToSpeech.call(
|
||||
text,
|
||||
lang,
|
||||
currentStyle,
|
||||
totalStep,
|
||||
speed,
|
||||
0.3,
|
||||
(step, total) => {
|
||||
showStatus(`ℹ️ <strong>Denoising (${step}/${total})...</strong>`);
|
||||
}
|
||||
);
|
||||
|
||||
const toc = Date.now();
|
||||
console.log(`Text-to-speech synthesis: ${((toc - tic) / 1000).toFixed(2)}s`);
|
||||
|
||||
showStatus('ℹ️ <strong>Creating audio file...</strong>');
|
||||
const wavLen = Math.floor(textToSpeech.sampleRate * duration[0]);
|
||||
const wavOut = wav.slice(0, wavLen);
|
||||
|
||||
// Create WAV file
|
||||
const wavBuffer = writeWavFile(wavOut, textToSpeech.sampleRate);
|
||||
const blob = new Blob([wavBuffer], { type: 'audio/wav' });
|
||||
const url = URL.createObjectURL(blob);
|
||||
|
||||
// Calculate total time and audio duration
|
||||
const endTime = Date.now();
|
||||
const totalTimeSec = ((endTime - startTime) / 1000).toFixed(2);
|
||||
const audioDurationSec = duration[0].toFixed(2);
|
||||
|
||||
// Display result with full text
|
||||
resultsContainer.innerHTML = `
|
||||
<div class="result-item">
|
||||
<div class="result-text-container">
|
||||
<div class="result-text-label">Input Text</div>
|
||||
<div class="result-text">${text}</div>
|
||||
</div>
|
||||
<div class="result-info">
|
||||
<div class="info-item">
|
||||
<span>📊 Audio Length</span>
|
||||
<strong>${audioDurationSec}s</strong>
|
||||
</div>
|
||||
<div class="info-item">
|
||||
<span>⏱️ Generation Time</span>
|
||||
<strong>${totalTimeSec}s</strong>
|
||||
</div>
|
||||
</div>
|
||||
<div class="result-player">
|
||||
<audio controls>
|
||||
<source src="${url}" type="audio/wav">
|
||||
</audio>
|
||||
</div>
|
||||
<div class="result-actions">
|
||||
<button onclick="downloadAudio('${url}', 'synthesized_speech.wav')">
|
||||
<span>⬇️</span>
|
||||
<span>Download WAV</span>
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
|
||||
showStatus('✅ <strong>Speech synthesis completed successfully!</strong>', 'success');
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error during synthesis:', error);
|
||||
showStatus(`❌ <strong>Error during synthesis:</strong> ${error.message}`, 'error');
|
||||
showError(`Error during synthesis: ${error.message}`);
|
||||
|
||||
// Restore placeholder
|
||||
resultsContainer.innerHTML = `
|
||||
<div class="results-placeholder">
|
||||
<div class="results-placeholder-icon">🎤</div>
|
||||
<p>Generated speech will appear here</p>
|
||||
</div>
|
||||
`;
|
||||
} finally {
|
||||
generateBtn.disabled = false;
|
||||
}
|
||||
}
|
||||
|
||||
// Download handler (make it global so it can be called from onclick)
|
||||
window.downloadAudio = function(url, filename) {
|
||||
const a = document.createElement('a');
|
||||
a.href = url;
|
||||
a.download = filename;
|
||||
a.click();
|
||||
};
|
||||
|
||||
// Attach generate function to button
|
||||
generateBtn.addEventListener('click', generateSpeech);
|
||||
|
||||
// Initialize on load
|
||||
window.addEventListener('load', async () => {
|
||||
generateBtn.disabled = true;
|
||||
await initializeModels();
|
||||
});
|
||||
21
web/package.json
Normal file
@@ -0,0 +1,21 @@
|
||||
{
|
||||
"name": "tts-onnx-web",
|
||||
"version": "1.0.0",
|
||||
"description": "TTS inference using ONNX Runtime for Web Browser",
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"dev": "vite",
|
||||
"build": "vite build",
|
||||
"preview": "vite preview"
|
||||
},
|
||||
"keywords": ["tts", "onnx", "speech-synthesis", "web"],
|
||||
"author": "",
|
||||
"license": "MIT",
|
||||
"dependencies": {
|
||||
"fft.js": "^4.0.3",
|
||||
"onnxruntime-web": "^1.17.0"
|
||||
},
|
||||
"devDependencies": {
|
||||
"vite": "^5.0.0"
|
||||
}
|
||||
}
|
||||
453
web/style.css
Normal file
@@ -0,0 +1,453 @@
|
||||
* {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
box-sizing: border-box;
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
min-height: 100vh;
|
||||
display: flex;
|
||||
justify-content: center;
|
||||
align-items: center;
|
||||
padding: 20px;
|
||||
}
|
||||
|
||||
.container {
|
||||
background: white;
|
||||
border-radius: 20px;
|
||||
padding: 40px;
|
||||
max-width: 1400px;
|
||||
width: 100%;
|
||||
box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
|
||||
}
|
||||
|
||||
.main-content {
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 40px;
|
||||
margin-top: 30px;
|
||||
align-items: start;
|
||||
}
|
||||
|
||||
.left-panel {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
}
|
||||
|
||||
.right-panel {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
height: 100%;
|
||||
}
|
||||
|
||||
@media (max-width: 1024px) {
|
||||
.main-content {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
}
|
||||
|
||||
h1 {
|
||||
color: #333;
|
||||
margin-bottom: 10px;
|
||||
font-size: 2em;
|
||||
}
|
||||
|
||||
.subtitle {
|
||||
color: #666;
|
||||
margin-bottom: 30px;
|
||||
font-size: 1.1em;
|
||||
}
|
||||
|
||||
.section {
|
||||
margin-bottom: 25px;
|
||||
}
|
||||
|
||||
label {
|
||||
display: block;
|
||||
font-weight: 600;
|
||||
color: #333;
|
||||
margin-bottom: 8px;
|
||||
font-size: 0.95em;
|
||||
}
|
||||
|
||||
input[type="file"],
|
||||
textarea,
|
||||
input[type="number"] {
|
||||
width: 100%;
|
||||
padding: 12px;
|
||||
border: 2px solid #e0e0e0;
|
||||
border-radius: 8px;
|
||||
font-size: 1em;
|
||||
transition: border-color 0.3s;
|
||||
}
|
||||
|
||||
input[type="file"]:focus,
|
||||
textarea:focus,
|
||||
input[type="number"]:focus {
|
||||
outline: none;
|
||||
border-color: #667eea;
|
||||
}
|
||||
|
||||
textarea {
|
||||
resize: vertical;
|
||||
min-height: 100px;
|
||||
font-family: inherit;
|
||||
}
|
||||
|
||||
.params-grid {
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 15px;
|
||||
}
|
||||
|
||||
button {
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
color: white;
|
||||
border: none;
|
||||
padding: 15px 30px;
|
||||
font-size: 1.1em;
|
||||
font-weight: 600;
|
||||
border-radius: 8px;
|
||||
cursor: pointer;
|
||||
width: 100%;
|
||||
transition: transform 0.2s, box-shadow 0.2s;
|
||||
}
|
||||
|
||||
button:hover:not(:disabled) {
|
||||
transform: translateY(-2px);
|
||||
box-shadow: 0 5px 20px rgba(102, 126, 234, 0.4);
|
||||
}
|
||||
|
||||
button:disabled {
|
||||
opacity: 0.6;
|
||||
cursor: not-allowed;
|
||||
}
|
||||
|
||||
.status-box {
|
||||
background: #e3f2fd;
|
||||
border-left: 4px solid #2196f3;
|
||||
padding: 15px;
|
||||
margin-bottom: 10px;
|
||||
border-radius: 4px;
|
||||
font-size: 0.9em;
|
||||
color: #1565c0;
|
||||
transition: all 0.3s ease;
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
flex-wrap: wrap;
|
||||
gap: 15px;
|
||||
min-height: 50px;
|
||||
}
|
||||
|
||||
.status-box.success {
|
||||
background: #e8f5e9;
|
||||
border-left-color: #4caf50;
|
||||
color: #2e7d32;
|
||||
}
|
||||
|
||||
.status-box.error {
|
||||
background: #ffebee;
|
||||
border-left-color: #f44336;
|
||||
color: #c62828;
|
||||
}
|
||||
|
||||
.status-text-wrapper {
|
||||
flex: 1;
|
||||
min-width: 200px;
|
||||
}
|
||||
|
||||
.backend-badge {
|
||||
display: inline-block;
|
||||
visibility: hidden;
|
||||
padding: 6px 12px;
|
||||
background: #ff9800;
|
||||
color: white;
|
||||
border-radius: 12px;
|
||||
font-size: 0.85em;
|
||||
font-weight: 600;
|
||||
margin-left: 10px;
|
||||
white-space: nowrap;
|
||||
}
|
||||
|
||||
.backend-badge.visible {
|
||||
visibility: visible;
|
||||
}
|
||||
|
||||
.ref-audio-info {
|
||||
color: #4caf50;
|
||||
font-weight: 700;
|
||||
font-size: 0.95em;
|
||||
}
|
||||
|
||||
.ref-audio-label {
|
||||
margin-bottom: 8px;
|
||||
}
|
||||
|
||||
.ref-audio-label label {
|
||||
display: inline;
|
||||
margin-bottom: 0;
|
||||
}
|
||||
|
||||
|
||||
.results {
|
||||
flex: 1;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
}
|
||||
|
||||
.result-item {
|
||||
background: white;
|
||||
border-radius: 16px;
|
||||
box-shadow: 0 2px 12px rgba(0, 0, 0, 0.08);
|
||||
overflow: hidden;
|
||||
transition: box-shadow 0.3s ease;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
flex: 1;
|
||||
}
|
||||
|
||||
.result-item:hover {
|
||||
box-shadow: 0 4px 20px rgba(0, 0, 0, 0.12);
|
||||
}
|
||||
|
||||
.result-item h3 {
|
||||
color: #667eea;
|
||||
margin-bottom: 15px;
|
||||
font-size: 1.2em;
|
||||
}
|
||||
|
||||
.result-text-container {
|
||||
padding: 20px;
|
||||
background: linear-gradient(135deg, #f8f9ff 0%, #ffffff 100%);
|
||||
border-bottom: 1px solid #e8ecf5;
|
||||
flex: 1;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.result-text-label {
|
||||
font-size: 0.75em;
|
||||
text-transform: uppercase;
|
||||
letter-spacing: 0.5px;
|
||||
color: #667eea;
|
||||
font-weight: 600;
|
||||
margin-bottom: 8px;
|
||||
}
|
||||
|
||||
.result-text {
|
||||
color: #333;
|
||||
line-height: 1.7;
|
||||
font-size: 0.95em;
|
||||
word-wrap: break-word;
|
||||
white-space: pre-wrap;
|
||||
overflow-y: auto;
|
||||
padding-right: 8px;
|
||||
flex: 1;
|
||||
}
|
||||
|
||||
.result-text::-webkit-scrollbar {
|
||||
width: 6px;
|
||||
}
|
||||
|
||||
.result-text::-webkit-scrollbar-track {
|
||||
background: #f0f0f0;
|
||||
border-radius: 3px;
|
||||
}
|
||||
|
||||
.result-text::-webkit-scrollbar-thumb {
|
||||
background: #c0c0c0;
|
||||
border-radius: 3px;
|
||||
}
|
||||
|
||||
.result-text::-webkit-scrollbar-thumb:hover {
|
||||
background: #a0a0a0;
|
||||
}
|
||||
|
||||
.result-info {
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 0;
|
||||
background: #fafbff;
|
||||
}
|
||||
|
||||
.info-item {
|
||||
padding: 16px 20px;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 8px;
|
||||
font-size: 0.9em;
|
||||
color: #666;
|
||||
border-bottom: 1px solid #e8ecf5;
|
||||
}
|
||||
|
||||
.info-item:nth-child(1) {
|
||||
border-right: 1px solid #e8ecf5;
|
||||
}
|
||||
|
||||
.info-item strong {
|
||||
color: #333;
|
||||
font-size: 1.1em;
|
||||
font-weight: 600;
|
||||
margin-left: auto;
|
||||
}
|
||||
|
||||
.result-player {
|
||||
padding: 20px;
|
||||
background: white;
|
||||
}
|
||||
|
||||
.result-item audio {
|
||||
width: 100%;
|
||||
height: 48px;
|
||||
outline: none;
|
||||
}
|
||||
|
||||
.result-item audio:focus {
|
||||
outline: 2px solid #667eea;
|
||||
outline-offset: 2px;
|
||||
border-radius: 4px;
|
||||
}
|
||||
|
||||
.result-actions {
|
||||
padding: 16px 20px 20px;
|
||||
background: white;
|
||||
}
|
||||
|
||||
.result-item button {
|
||||
width: 100%;
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
color: white;
|
||||
border: none;
|
||||
padding: 12px 24px;
|
||||
font-size: 0.95em;
|
||||
font-weight: 600;
|
||||
border-radius: 8px;
|
||||
cursor: pointer;
|
||||
transition: all 0.3s ease;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
gap: 8px;
|
||||
}
|
||||
|
||||
.result-item button:hover {
|
||||
transform: translateY(-2px);
|
||||
box-shadow: 0 4px 16px rgba(102, 126, 234, 0.3);
|
||||
}
|
||||
|
||||
.result-item button:active {
|
||||
transform: translateY(0);
|
||||
}
|
||||
|
||||
@media (max-width: 640px) {
|
||||
.result-info {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
|
||||
.info-item:nth-child(1) {
|
||||
border-right: none;
|
||||
}
|
||||
}
|
||||
|
||||
audio {
|
||||
width: 100%;
|
||||
margin-top: 10px;
|
||||
}
|
||||
|
||||
.error {
|
||||
background: #fee;
|
||||
color: #c00;
|
||||
padding: 15px;
|
||||
border-radius: 8px;
|
||||
margin-top: 20px;
|
||||
display: none;
|
||||
}
|
||||
|
||||
.error.active {
|
||||
display: block;
|
||||
}
|
||||
|
||||
.warning-box {
|
||||
background: #fff3cd;
|
||||
color: #856404;
|
||||
padding: 12px 15px;
|
||||
border-radius: 8px;
|
||||
margin-top: 10px;
|
||||
border-left: 4px solid #ffc107;
|
||||
font-size: 0.9em;
|
||||
display: none;
|
||||
line-height: 1.5;
|
||||
}
|
||||
|
||||
.warning-box.active {
|
||||
display: block;
|
||||
}
|
||||
|
||||
.warning-box::before {
|
||||
content: "⚠️ ";
|
||||
margin-right: 5px;
|
||||
}
|
||||
|
||||
.results-placeholder {
|
||||
background: white;
|
||||
border-radius: 16px;
|
||||
box-shadow: 0 2px 12px rgba(0, 0, 0, 0.08);
|
||||
padding: 60px 40px;
|
||||
text-align: center;
|
||||
color: #999;
|
||||
transition: all 0.3s ease;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
justify-content: center;
|
||||
align-items: center;
|
||||
flex: 1;
|
||||
min-height: 400px;
|
||||
}
|
||||
|
||||
.results-placeholder:hover {
|
||||
box-shadow: 0 4px 20px rgba(0, 0, 0, 0.12);
|
||||
}
|
||||
|
||||
.results-placeholder-icon {
|
||||
font-size: 4em;
|
||||
margin-bottom: 20px;
|
||||
opacity: 0.6;
|
||||
animation: float 3s ease-in-out infinite;
|
||||
}
|
||||
|
||||
.results-placeholder.generating .results-placeholder-icon {
|
||||
animation: spin 2s linear infinite;
|
||||
}
|
||||
|
||||
@keyframes float {
|
||||
0%, 100% {
|
||||
transform: translateY(0px);
|
||||
}
|
||||
50% {
|
||||
transform: translateY(-10px);
|
||||
}
|
||||
}
|
||||
|
||||
@keyframes spin {
|
||||
0% {
|
||||
transform: rotate(0deg);
|
||||
}
|
||||
100% {
|
||||
transform: rotate(360deg);
|
||||
}
|
||||
}
|
||||
|
||||
.results-placeholder p {
|
||||
font-size: 1.05em;
|
||||
color: #888;
|
||||
font-weight: 500;
|
||||
margin: 0;
|
||||
}
|
||||
|
||||
.hidden {
|
||||
display: none;
|
||||
}
|
||||