initial commit
This commit is contained in:
4
web/.gitignore
vendored
Normal file
4
web/.gitignore
vendored
Normal file
@@ -0,0 +1,4 @@
|
||||
node_modules/
|
||||
dist/
|
||||
.DS_Store
|
||||
*.log
|
||||
121
web/README.md
Normal file
121
web/README.md
Normal file
@@ -0,0 +1,121 @@
|
||||
# Supertonic Web Example
|
||||
|
||||
This example demonstrates how to use Supertonic in a web browser using ONNX Runtime Web.
|
||||
|
||||
## 📰 Update News
|
||||
|
||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
||||
|
||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||
|
||||
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
|
||||
|
||||
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
|
||||
|
||||
**2025.11.19** - Added speed control slider to adjust speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
|
||||
|
||||
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
|
||||
|
||||
## Features
|
||||
|
||||
- 🌐 Runs entirely in the browser (no server required for inference)
|
||||
- 🚀 WebGPU support with automatic fallback to WebAssembly
|
||||
- 🌍 Multilingual support: English (en), Korean (ko), Spanish (es), Portuguese (pt), French (fr)
|
||||
- ⚡ Pre-extracted voice styles for instant generation
|
||||
- 🎨 Modern, responsive UI
|
||||
- 🎭 Multiple voice style presets (5 Male, 5 Female)
|
||||
- 💾 Download generated audio as WAV files
|
||||
- 📊 Detailed generation statistics (audio length, generation time)
|
||||
- ⏱️ Real-time progress tracking
|
||||
|
||||
## Requirements
|
||||
|
||||
- Node.js (for development server)
|
||||
- Modern web browser (Chrome, Edge, Firefox, Safari)
|
||||
|
||||
## Installation
|
||||
|
||||
1. Install dependencies:
|
||||
|
||||
```bash
|
||||
npm install
|
||||
```
|
||||
|
||||
## Running the Demo
|
||||
|
||||
Start the development server:
|
||||
|
||||
```bash
|
||||
npm run dev
|
||||
```
|
||||
|
||||
This will start a local development server (usually at http://localhost:3000) and open the demo in your browser.
|
||||
|
||||
## Usage
|
||||
|
||||
1. **Wait for Models to Load**: The app will automatically load models and the default voice style (M1)
|
||||
2. **Select Voice Style**: Choose from available voice presets
|
||||
- **Male 1-5 (M1-M5)**: Male voice styles
|
||||
- **Female 1-5 (F1-F5)**: Female voice styles
|
||||
3. **Select Language**: Choose the language that matches your input text
|
||||
- **English (en)**: Default language
|
||||
- **한국어 (ko)**: Korean
|
||||
- **Español (es)**: Spanish
|
||||
- **Português (pt)**: Portuguese
|
||||
- **Français (fr)**: French
|
||||
4. **Enter Text**: Type or paste the text you want to convert to speech
|
||||
5. **Adjust Settings** (optional):
|
||||
- **Total Steps**: More steps = better quality but slower (default: 5)
|
||||
6. **Generate Speech**: Click the "Generate Speech" button
|
||||
7. **View Results**:
|
||||
- See the full input text
|
||||
- View audio length and generation time statistics
|
||||
- Play the generated audio in the browser
|
||||
- Download as WAV file
|
||||
|
||||
## Multilingual Support
|
||||
|
||||
Supertonic 2 supports multiple languages. Make sure to select the correct language for your input text to get the best results. The model will automatically handle text preprocessing and pronunciation for the selected language.
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Browser Compatibility
|
||||
|
||||
This demo uses:
|
||||
- **ONNX Runtime Web**: For running models in the browser
|
||||
- **Web Audio API**: For playing generated audio
|
||||
- **Vite**: For development and bundling
|
||||
|
||||
## Notes
|
||||
|
||||
- The ONNX models must be accessible at `assets/onnx/` relative to the web root
|
||||
- Voice style JSON files must be accessible at `assets/voice_styles/` relative to the web root
|
||||
- Pre-extracted voice styles enable instant generation without audio processing
|
||||
- Ten voice style presets are provided (M1-M5, F1-F5)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Models not loading
|
||||
- Check browser console for errors
|
||||
- Ensure `assets/onnx/` path is correct and models are accessible
|
||||
- Check CORS settings if serving from a different domain
|
||||
|
||||
### WebGPU not available
|
||||
- WebGPU is only available in recent Chrome/Edge browsers (version 113+)
|
||||
- The app will automatically fall back to WebAssembly if WebGPU is not available
|
||||
- Check the backend badge to see which execution provider is being used
|
||||
|
||||
### Out of memory errors
|
||||
- Try shorter text inputs
|
||||
- Reduce denoising steps
|
||||
- Use a browser with more available memory
|
||||
- Close other tabs to free up memory
|
||||
|
||||
### Audio quality issues
|
||||
- Try different voice style presets
|
||||
- Increase denoising steps for better quality
|
||||
|
||||
### Slow generation
|
||||
- If using WebAssembly, try a browser that supports WebGPU
|
||||
- Ensure no other heavy processes are running
|
||||
- Consider using fewer denoising steps for faster (but lower quality) results
|
||||
561
web/helper.js
Normal file
561
web/helper.js
Normal file
@@ -0,0 +1,561 @@
|
||||
import * as ort from 'onnxruntime-web';
|
||||
|
||||
// Available languages for multilingual TTS
|
||||
export const AVAILABLE_LANGS = ['en', 'ko', 'es', 'pt', 'fr'];
|
||||
|
||||
export function isValidLang(lang) {
|
||||
return AVAILABLE_LANGS.includes(lang);
|
||||
}
|
||||
|
||||
/**
|
||||
* Unicode Text Processor
|
||||
*/
|
||||
export class UnicodeProcessor {
|
||||
constructor(indexer) {
|
||||
this.indexer = indexer;
|
||||
}
|
||||
|
||||
call(textList, langList) {
|
||||
const processedTexts = textList.map((text, i) => this.preprocessText(text, langList[i]));
|
||||
|
||||
const textIdsLengths = processedTexts.map(text => text.length);
|
||||
const maxLen = Math.max(...textIdsLengths);
|
||||
|
||||
const textIds = processedTexts.map(text => {
|
||||
const row = new Array(maxLen).fill(0);
|
||||
for (let j = 0; j < text.length; j++) {
|
||||
const codePoint = text.codePointAt(j);
|
||||
row[j] = (codePoint < this.indexer.length) ? this.indexer[codePoint] : -1;
|
||||
}
|
||||
return row;
|
||||
});
|
||||
|
||||
const textMask = this.getTextMask(textIdsLengths);
|
||||
return { textIds, textMask };
|
||||
}
|
||||
|
||||
preprocessText(text, lang) {
|
||||
// TODO: Need advanced normalizer for better performance
|
||||
text = text.normalize('NFKD');
|
||||
|
||||
// Remove emojis (wide Unicode range)
|
||||
const emojiPattern = /[\u{1F600}-\u{1F64F}\u{1F300}-\u{1F5FF}\u{1F680}-\u{1F6FF}\u{1F700}-\u{1F77F}\u{1F780}-\u{1F7FF}\u{1F800}-\u{1F8FF}\u{1F900}-\u{1F9FF}\u{1FA00}-\u{1FA6F}\u{1FA70}-\u{1FAFF}\u{2600}-\u{26FF}\u{2700}-\u{27BF}\u{1F1E6}-\u{1F1FF}]+/gu;
|
||||
text = text.replace(emojiPattern, '');
|
||||
|
||||
// Replace various dashes and symbols
|
||||
const replacements = {
|
||||
'–': '-',
|
||||
'‑': '-',
|
||||
'—': '-',
|
||||
'_': ' ',
|
||||
'\u201C': '"', // left double quote "
|
||||
'\u201D': '"', // right double quote "
|
||||
'\u2018': "'", // left single quote '
|
||||
'\u2019': "'", // right single quote '
|
||||
'´': "'",
|
||||
'`': "'",
|
||||
'[': ' ',
|
||||
']': ' ',
|
||||
'|': ' ',
|
||||
'/': ' ',
|
||||
'#': ' ',
|
||||
'→': ' ',
|
||||
'←': ' ',
|
||||
};
|
||||
for (const [k, v] of Object.entries(replacements)) {
|
||||
text = text.replaceAll(k, v);
|
||||
}
|
||||
|
||||
// Remove special symbols
|
||||
text = text.replace(/[♥☆♡©\\]/g, '');
|
||||
|
||||
// Replace known expressions
|
||||
const exprReplacements = {
|
||||
'@': ' at ',
|
||||
'e.g.,': 'for example, ',
|
||||
'i.e.,': 'that is, ',
|
||||
};
|
||||
for (const [k, v] of Object.entries(exprReplacements)) {
|
||||
text = text.replaceAll(k, v);
|
||||
}
|
||||
|
||||
// Fix spacing around punctuation
|
||||
text = text.replace(/ ,/g, ',');
|
||||
text = text.replace(/ \./g, '.');
|
||||
text = text.replace(/ !/g, '!');
|
||||
text = text.replace(/ \?/g, '?');
|
||||
text = text.replace(/ ;/g, ';');
|
||||
text = text.replace(/ :/g, ':');
|
||||
text = text.replace(/ '/g, "'");
|
||||
|
||||
// Remove duplicate quotes
|
||||
while (text.includes('""')) {
|
||||
text = text.replace('""', '"');
|
||||
}
|
||||
while (text.includes("''")) {
|
||||
text = text.replace("''", "'");
|
||||
}
|
||||
while (text.includes('``')) {
|
||||
text = text.replace('``', '`');
|
||||
}
|
||||
|
||||
// Remove extra spaces
|
||||
text = text.replace(/\s+/g, ' ').trim();
|
||||
|
||||
// If text doesn't end with punctuation, quotes, or closing brackets, add a period
|
||||
if (!/[.!?;:,'\"')\]}…。」』】〉》›»]$/.test(text)) {
|
||||
text += '.';
|
||||
}
|
||||
|
||||
// Validate language
|
||||
if (!isValidLang(lang)) {
|
||||
throw new Error(`Invalid language: ${lang}. Available: ${AVAILABLE_LANGS.join(', ')}`);
|
||||
}
|
||||
|
||||
// Wrap text with language tags
|
||||
text = `<${lang}>${text}</${lang}>`;
|
||||
|
||||
return text;
|
||||
}
|
||||
|
||||
getTextMask(textIdsLengths) {
|
||||
const maxLen = Math.max(...textIdsLengths);
|
||||
return this.lengthToMask(textIdsLengths, maxLen);
|
||||
}
|
||||
|
||||
lengthToMask(lengths, maxLen = null) {
|
||||
const actualMaxLen = maxLen || Math.max(...lengths);
|
||||
return lengths.map(len => {
|
||||
const row = new Array(actualMaxLen).fill(0.0);
|
||||
for (let j = 0; j < Math.min(len, actualMaxLen); j++) {
|
||||
row[j] = 1.0;
|
||||
}
|
||||
return [row];
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Style class to hold TTL and DP tensors
|
||||
*/
|
||||
export class Style {
|
||||
constructor(ttlTensor, dpTensor) {
|
||||
this.ttl = ttlTensor;
|
||||
this.dp = dpTensor;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Text-to-Speech class
|
||||
*/
|
||||
export class TextToSpeech {
|
||||
constructor(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) {
|
||||
this.cfgs = cfgs;
|
||||
this.textProcessor = textProcessor;
|
||||
this.dpOrt = dpOrt;
|
||||
this.textEncOrt = textEncOrt;
|
||||
this.vectorEstOrt = vectorEstOrt;
|
||||
this.vocoderOrt = vocoderOrt;
|
||||
this.sampleRate = cfgs.ae.sample_rate;
|
||||
}
|
||||
|
||||
async _infer(textList, langList, style, totalStep, speed = 1.05, progressCallback = null) {
|
||||
const bsz = textList.length;
|
||||
|
||||
// Process text
|
||||
const { textIds, textMask } = this.textProcessor.call(textList, langList);
|
||||
|
||||
const textIdsFlat = new BigInt64Array(textIds.flat().map(x => BigInt(x)));
|
||||
const textIdsShape = [bsz, textIds[0].length];
|
||||
const textIdsTensor = new ort.Tensor('int64', textIdsFlat, textIdsShape);
|
||||
|
||||
const textMaskFlat = new Float32Array(textMask.flat(2));
|
||||
const textMaskShape = [bsz, 1, textMask[0][0].length];
|
||||
const textMaskTensor = new ort.Tensor('float32', textMaskFlat, textMaskShape);
|
||||
|
||||
// Predict duration
|
||||
const dpOutputs = await this.dpOrt.run({
|
||||
text_ids: textIdsTensor,
|
||||
style_dp: style.dp,
|
||||
text_mask: textMaskTensor
|
||||
});
|
||||
const duration = Array.from(dpOutputs.duration.data);
|
||||
|
||||
// Apply speed factor to duration
|
||||
for (let i = 0; i < duration.length; i++) {
|
||||
duration[i] /= speed;
|
||||
}
|
||||
|
||||
// Encode text
|
||||
const textEncOutputs = await this.textEncOrt.run({
|
||||
text_ids: textIdsTensor,
|
||||
style_ttl: style.ttl,
|
||||
text_mask: textMaskTensor
|
||||
});
|
||||
const textEmb = textEncOutputs.text_emb;
|
||||
|
||||
// Sample noisy latent
|
||||
let { xt, latentMask } = this.sampleNoisyLatent(
|
||||
duration,
|
||||
this.sampleRate,
|
||||
this.cfgs.ae.base_chunk_size,
|
||||
this.cfgs.ttl.chunk_compress_factor,
|
||||
this.cfgs.ttl.latent_dim
|
||||
);
|
||||
|
||||
const latentMaskFlat = new Float32Array(latentMask.flat(2));
|
||||
const latentMaskShape = [bsz, 1, latentMask[0][0].length];
|
||||
const latentMaskTensor = new ort.Tensor('float32', latentMaskFlat, latentMaskShape);
|
||||
|
||||
// Prepare constant arrays
|
||||
const totalStepArray = new Float32Array(bsz).fill(totalStep);
|
||||
const totalStepTensor = new ort.Tensor('float32', totalStepArray, [bsz]);
|
||||
|
||||
// Denoising loop
|
||||
for (let step = 0; step < totalStep; step++) {
|
||||
if (progressCallback) {
|
||||
progressCallback(step + 1, totalStep);
|
||||
}
|
||||
|
||||
const currentStepArray = new Float32Array(bsz).fill(step);
|
||||
const currentStepTensor = new ort.Tensor('float32', currentStepArray, [bsz]);
|
||||
|
||||
const xtFlat = new Float32Array(xt.flat(2));
|
||||
const xtShape = [bsz, xt[0].length, xt[0][0].length];
|
||||
const xtTensor = new ort.Tensor('float32', xtFlat, xtShape);
|
||||
|
||||
const vectorEstOutputs = await this.vectorEstOrt.run({
|
||||
noisy_latent: xtTensor,
|
||||
text_emb: textEmb,
|
||||
style_ttl: style.ttl,
|
||||
latent_mask: latentMaskTensor,
|
||||
text_mask: textMaskTensor,
|
||||
current_step: currentStepTensor,
|
||||
total_step: totalStepTensor
|
||||
});
|
||||
|
||||
const denoised = Array.from(vectorEstOutputs.denoised_latent.data);
|
||||
|
||||
// Reshape to 3D
|
||||
const latentDim = xt[0].length;
|
||||
const latentLen = xt[0][0].length;
|
||||
xt = [];
|
||||
let idx = 0;
|
||||
for (let b = 0; b < bsz; b++) {
|
||||
const batch = [];
|
||||
for (let d = 0; d < latentDim; d++) {
|
||||
const row = [];
|
||||
for (let t = 0; t < latentLen; t++) {
|
||||
row.push(denoised[idx++]);
|
||||
}
|
||||
batch.push(row);
|
||||
}
|
||||
xt.push(batch);
|
||||
}
|
||||
}
|
||||
|
||||
// Generate waveform
|
||||
const finalXtFlat = new Float32Array(xt.flat(2));
|
||||
const finalXtShape = [bsz, xt[0].length, xt[0][0].length];
|
||||
const finalXtTensor = new ort.Tensor('float32', finalXtFlat, finalXtShape);
|
||||
|
||||
const vocoderOutputs = await this.vocoderOrt.run({
|
||||
latent: finalXtTensor
|
||||
});
|
||||
|
||||
const wav = Array.from(vocoderOutputs.wav_tts.data);
|
||||
|
||||
return { wav, duration };
|
||||
}
|
||||
|
||||
async call(text, lang, style, totalStep, speed = 1.05, silenceDuration = 0.3, progressCallback = null) {
|
||||
if (style.ttl.dims[0] !== 1) {
|
||||
throw new Error('Single speaker text to speech only supports single style');
|
||||
}
|
||||
const maxLen = lang === 'ko' ? 120 : 300;
|
||||
const textList = chunkText(text, maxLen);
|
||||
const langList = new Array(textList.length).fill(lang);
|
||||
let wavCat = [];
|
||||
let durCat = 0;
|
||||
|
||||
for (let i = 0; i < textList.length; i++) {
|
||||
const { wav, duration } = await this._infer([textList[i]], [langList[i]], style, totalStep, speed, progressCallback);
|
||||
|
||||
if (wavCat.length === 0) {
|
||||
wavCat = wav;
|
||||
durCat = duration[0];
|
||||
} else {
|
||||
const silenceLen = Math.floor(silenceDuration * this.sampleRate);
|
||||
const silence = new Array(silenceLen).fill(0);
|
||||
wavCat = [...wavCat, ...silence, ...wav];
|
||||
durCat += duration[0] + silenceDuration;
|
||||
}
|
||||
}
|
||||
|
||||
return { wav: wavCat, duration: [durCat] };
|
||||
}
|
||||
|
||||
async batch(textList, langList, style, totalStep, speed = 1.05, progressCallback = null) {
|
||||
return await this._infer(textList, langList, style, totalStep, speed, progressCallback);
|
||||
}
|
||||
|
||||
sampleNoisyLatent(duration, sampleRate, baseChunkSize, chunkCompress, latentDim) {
|
||||
const bsz = duration.length;
|
||||
const maxDur = Math.max(...duration);
|
||||
|
||||
const wavLenMax = Math.floor(maxDur * sampleRate);
|
||||
const wavLengths = duration.map(d => Math.floor(d * sampleRate));
|
||||
|
||||
const chunkSize = baseChunkSize * chunkCompress;
|
||||
const latentLen = Math.floor((wavLenMax + chunkSize - 1) / chunkSize);
|
||||
const latentDimVal = latentDim * chunkCompress;
|
||||
|
||||
const xt = [];
|
||||
for (let b = 0; b < bsz; b++) {
|
||||
const batch = [];
|
||||
for (let d = 0; d < latentDimVal; d++) {
|
||||
const row = [];
|
||||
for (let t = 0; t < latentLen; t++) {
|
||||
// Box-Muller transform
|
||||
const u1 = Math.max(0.0001, Math.random());
|
||||
const u2 = Math.random();
|
||||
const val = Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2);
|
||||
row.push(val);
|
||||
}
|
||||
batch.push(row);
|
||||
}
|
||||
xt.push(batch);
|
||||
}
|
||||
|
||||
const latentLengths = wavLengths.map(len => Math.floor((len + chunkSize - 1) / chunkSize));
|
||||
const latentMask = this.lengthToMask(latentLengths, latentLen);
|
||||
|
||||
// Apply mask
|
||||
for (let b = 0; b < bsz; b++) {
|
||||
for (let d = 0; d < latentDimVal; d++) {
|
||||
for (let t = 0; t < latentLen; t++) {
|
||||
xt[b][d][t] *= latentMask[b][0][t];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return { xt, latentMask };
|
||||
}
|
||||
|
||||
lengthToMask(lengths, maxLen = null) {
|
||||
const actualMaxLen = maxLen || Math.max(...lengths);
|
||||
return lengths.map(len => {
|
||||
const row = new Array(actualMaxLen).fill(0.0);
|
||||
for (let j = 0; j < Math.min(len, actualMaxLen); j++) {
|
||||
row[j] = 1.0;
|
||||
}
|
||||
return [row];
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Load voice style from JSON files
|
||||
*/
|
||||
export async function loadVoiceStyle(voiceStylePaths, verbose = false) {
|
||||
const bsz = voiceStylePaths.length;
|
||||
|
||||
// Read first file to get dimensions
|
||||
const firstResponse = await fetch(voiceStylePaths[0]);
|
||||
const firstStyle = await firstResponse.json();
|
||||
|
||||
const ttlDims = firstStyle.style_ttl.dims;
|
||||
const dpDims = firstStyle.style_dp.dims;
|
||||
|
||||
const ttlDim1 = ttlDims[1];
|
||||
const ttlDim2 = ttlDims[2];
|
||||
const dpDim1 = dpDims[1];
|
||||
const dpDim2 = dpDims[2];
|
||||
|
||||
// Pre-allocate arrays with full batch size
|
||||
const ttlSize = bsz * ttlDim1 * ttlDim2;
|
||||
const dpSize = bsz * dpDim1 * dpDim2;
|
||||
const ttlFlat = new Float32Array(ttlSize);
|
||||
const dpFlat = new Float32Array(dpSize);
|
||||
|
||||
// Fill in the data
|
||||
for (let i = 0; i < bsz; i++) {
|
||||
const response = await fetch(voiceStylePaths[i]);
|
||||
const voiceStyle = await response.json();
|
||||
|
||||
// Flatten TTL data
|
||||
const ttlData = voiceStyle.style_ttl.data.flat(Infinity);
|
||||
const ttlOffset = i * ttlDim1 * ttlDim2;
|
||||
ttlFlat.set(ttlData, ttlOffset);
|
||||
|
||||
// Flatten DP data
|
||||
const dpData = voiceStyle.style_dp.data.flat(Infinity);
|
||||
const dpOffset = i * dpDim1 * dpDim2;
|
||||
dpFlat.set(dpData, dpOffset);
|
||||
}
|
||||
|
||||
const ttlShape = [bsz, ttlDim1, ttlDim2];
|
||||
const dpShape = [bsz, dpDim1, dpDim2];
|
||||
|
||||
const ttlTensor = new ort.Tensor('float32', ttlFlat, ttlShape);
|
||||
const dpTensor = new ort.Tensor('float32', dpFlat, dpShape);
|
||||
|
||||
if (verbose) {
|
||||
console.log(`Loaded ${bsz} voice styles`);
|
||||
}
|
||||
|
||||
return new Style(ttlTensor, dpTensor);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load configuration from JSON
|
||||
*/
|
||||
export async function loadCfgs(onnxDir) {
|
||||
const response = await fetch(`${onnxDir}/tts.json`);
|
||||
const cfgs = await response.json();
|
||||
return cfgs;
|
||||
}
|
||||
|
||||
/**
|
||||
* Load text processor
|
||||
*/
|
||||
export async function loadTextProcessor(onnxDir) {
|
||||
const response = await fetch(`${onnxDir}/unicode_indexer.json`);
|
||||
const indexer = await response.json();
|
||||
return new UnicodeProcessor(indexer);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load ONNX model
|
||||
*/
|
||||
export async function loadOnnx(onnxPath, options) {
|
||||
const session = await ort.InferenceSession.create(onnxPath, options);
|
||||
return session;
|
||||
}
|
||||
|
||||
/**
|
||||
* Load all TTS components
|
||||
*/
|
||||
export async function loadTextToSpeech(onnxDir, sessionOptions = {}, progressCallback = null) {
|
||||
console.log('Using WebAssembly/WebGPU for inference');
|
||||
|
||||
const cfgs = await loadCfgs(onnxDir);
|
||||
|
||||
const dpPath = `${onnxDir}/duration_predictor.onnx`;
|
||||
const textEncPath = `${onnxDir}/text_encoder.onnx`;
|
||||
const vectorEstPath = `${onnxDir}/vector_estimator.onnx`;
|
||||
const vocoderPath = `${onnxDir}/vocoder.onnx`;
|
||||
|
||||
const modelPaths = [
|
||||
{ name: 'Duration Predictor', path: dpPath },
|
||||
{ name: 'Text Encoder', path: textEncPath },
|
||||
{ name: 'Vector Estimator', path: vectorEstPath },
|
||||
{ name: 'Vocoder', path: vocoderPath }
|
||||
];
|
||||
|
||||
const sessions = [];
|
||||
for (let i = 0; i < modelPaths.length; i++) {
|
||||
if (progressCallback) {
|
||||
progressCallback(modelPaths[i].name, i + 1, modelPaths.length);
|
||||
}
|
||||
const session = await loadOnnx(modelPaths[i].path, sessionOptions);
|
||||
sessions.push(session);
|
||||
}
|
||||
|
||||
const [dpOrt, textEncOrt, vectorEstOrt, vocoderOrt] = sessions;
|
||||
|
||||
const textProcessor = await loadTextProcessor(onnxDir);
|
||||
const textToSpeech = new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
|
||||
|
||||
return { textToSpeech, cfgs };
|
||||
}
|
||||
|
||||
/**
|
||||
* Chunk text into manageable segments
|
||||
*/
|
||||
function chunkText(text, maxLen = 300) {
|
||||
if (typeof text !== 'string') {
|
||||
throw new Error(`chunkText expects a string, got ${typeof text}`);
|
||||
}
|
||||
|
||||
// Split by paragraph (two or more newlines)
|
||||
const paragraphs = text.trim().split(/\n\s*\n+/).filter(p => p.trim());
|
||||
|
||||
const chunks = [];
|
||||
|
||||
for (let paragraph of paragraphs) {
|
||||
paragraph = paragraph.trim();
|
||||
if (!paragraph) continue;
|
||||
|
||||
// Split by sentence boundaries (period, question mark, exclamation mark followed by space)
|
||||
// But exclude common abbreviations like Mr., Mrs., Dr., etc. and single capital letters like F.
|
||||
const sentences = paragraph.split(/(?<!Mr\.|Mrs\.|Ms\.|Dr\.|Prof\.|Sr\.|Jr\.|Ph\.D\.|etc\.|e\.g\.|i\.e\.|vs\.|Inc\.|Ltd\.|Co\.|Corp\.|St\.|Ave\.|Blvd\.)(?<!\b[A-Z]\.)(?<=[.!?])\s+/);
|
||||
|
||||
let currentChunk = "";
|
||||
|
||||
for (let sentence of sentences) {
|
||||
if (currentChunk.length + sentence.length + 1 <= maxLen) {
|
||||
currentChunk += (currentChunk ? " " : "") + sentence;
|
||||
} else {
|
||||
if (currentChunk) {
|
||||
chunks.push(currentChunk.trim());
|
||||
}
|
||||
currentChunk = sentence;
|
||||
}
|
||||
}
|
||||
|
||||
if (currentChunk) {
|
||||
chunks.push(currentChunk.trim());
|
||||
}
|
||||
}
|
||||
|
||||
return chunks;
|
||||
}
|
||||
|
||||
/**
|
||||
* Write WAV file to ArrayBuffer
|
||||
*/
|
||||
export function writeWavFile(audioData, sampleRate) {
|
||||
const numChannels = 1;
|
||||
const bitsPerSample = 16;
|
||||
const byteRate = sampleRate * numChannels * bitsPerSample / 8;
|
||||
const blockAlign = numChannels * bitsPerSample / 8;
|
||||
const dataSize = audioData.length * 2;
|
||||
|
||||
// Create ArrayBuffer
|
||||
const buffer = new ArrayBuffer(44 + dataSize);
|
||||
const view = new DataView(buffer);
|
||||
|
||||
// Write WAV header
|
||||
const writeString = (offset, string) => {
|
||||
for (let i = 0; i < string.length; i++) {
|
||||
view.setUint8(offset + i, string.charCodeAt(i));
|
||||
}
|
||||
};
|
||||
|
||||
writeString(0, 'RIFF');
|
||||
view.setUint32(4, 36 + dataSize, true);
|
||||
writeString(8, 'WAVE');
|
||||
writeString(12, 'fmt ');
|
||||
view.setUint32(16, 16, true);
|
||||
view.setUint16(20, 1, true); // PCM
|
||||
view.setUint16(22, numChannels, true);
|
||||
view.setUint32(24, sampleRate, true);
|
||||
view.setUint32(28, byteRate, true);
|
||||
view.setUint16(32, blockAlign, true);
|
||||
view.setUint16(34, bitsPerSample, true);
|
||||
writeString(36, 'data');
|
||||
view.setUint32(40, dataSize, true);
|
||||
|
||||
// Write audio data
|
||||
const int16Data = new Int16Array(audioData.length);
|
||||
for (let i = 0; i < audioData.length; i++) {
|
||||
const clamped = Math.max(-1.0, Math.min(1.0, audioData[i]));
|
||||
int16Data[i] = Math.floor(clamped * 32767);
|
||||
}
|
||||
|
||||
const dataView = new Uint8Array(buffer, 44);
|
||||
dataView.set(new Uint8Array(int16Data.buffer));
|
||||
|
||||
return buffer;
|
||||
}
|
||||
95
web/index.html
Normal file
95
web/index.html
Normal file
@@ -0,0 +1,95 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Supertonic - Web Demo</title>
|
||||
<link rel="stylesheet" href="/style.css">
|
||||
</head>
|
||||
<body>
|
||||
<div class="container">
|
||||
<h1>🎤 Supertonic 2</h1>
|
||||
<p class="subtitle">Multilingual Text-to-Speech with ONNX Runtime Web</p>
|
||||
|
||||
<div id="statusBox" class="status-box">
|
||||
<div class="status-text-wrapper">
|
||||
<div id="statusText">ℹ️ <strong>Loading models...</strong>
|
||||
Please wait...</div>
|
||||
</div>
|
||||
<div id="backendBadge" class="backend-badge">WebAssembly</div>
|
||||
</div>
|
||||
|
||||
<div class="main-content">
|
||||
<div class="left-panel">
|
||||
<div class="section">
|
||||
<div class="ref-audio-label">
|
||||
<label for="voiceStyleSelect">Voice Style: </label>
|
||||
<span id="voiceStyleInfo"
|
||||
class="ref-audio-info">Loading...</span>
|
||||
</div>
|
||||
<select id="voiceStyleSelect">
|
||||
<option value="assets/voice_styles/M1.json">Male 1 (M1)</option>
|
||||
<option value="assets/voice_styles/M2.json">Male 2 (M2)</option>
|
||||
<option value="assets/voice_styles/M3.json">Male 3 (M3)</option>
|
||||
<option value="assets/voice_styles/M4.json">Male 4 (M4)</option>
|
||||
<option value="assets/voice_styles/M5.json">Male 5 (M5)</option>
|
||||
<option value="assets/voice_styles/F1.json">Female 1 (F1)</option>
|
||||
<option value="assets/voice_styles/F2.json">Female 2 (F2)</option>
|
||||
<option value="assets/voice_styles/F3.json">Female 3 (F3)</option>
|
||||
<option value="assets/voice_styles/F4.json">Female 4 (F4)</option>
|
||||
<option value="assets/voice_styles/F5.json">Female 5 (F5)</option>
|
||||
</select>
|
||||
</div>
|
||||
|
||||
<div class="section">
|
||||
<label for="langSelect">Language:</label>
|
||||
<select id="langSelect">
|
||||
<option value="en" selected>English (en)</option>
|
||||
<option value="ko">한국어 (ko)</option>
|
||||
<option value="es">Español (es)</option>
|
||||
<option value="pt">Português (pt)</option>
|
||||
<option value="fr">Français (fr)</option>
|
||||
</select>
|
||||
</div>
|
||||
|
||||
<div class="section">
|
||||
<label for="text">Text to Synthesize:</label>
|
||||
<textarea id="text"
|
||||
placeholder="Enter the text you want to convert to speech...">This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.</textarea>
|
||||
</div>
|
||||
|
||||
<div class="params-grid">
|
||||
<div class="section">
|
||||
<label for="totalStep">Total Steps (higher = better
|
||||
quality):</label>
|
||||
<input type="number" id="totalStep" value="5"
|
||||
min="1" max="50">
|
||||
</div>
|
||||
|
||||
<div class="section">
|
||||
<label for="speed">Speed (0.9-1.5 recommended):</label>
|
||||
<input type="number" id="speed" value="1.05"
|
||||
min="0.5" max="2.0" step="0.05">
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
<button id="generateBtn">Generate Speech</button>
|
||||
|
||||
<div id="error" class="error"></div>
|
||||
</div>
|
||||
|
||||
<div class="right-panel">
|
||||
<div id="results" class="results">
|
||||
<div class="results-placeholder">
|
||||
<div class="results-placeholder-icon">🎤</div>
|
||||
<p>Generated speech will appear here</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script type="module" src="/main.js"></script>
|
||||
</body>
|
||||
</html>
|
||||
291
web/main.js
Normal file
291
web/main.js
Normal file
@@ -0,0 +1,291 @@
|
||||
import {
|
||||
loadTextToSpeech,
|
||||
loadVoiceStyle,
|
||||
writeWavFile
|
||||
} from './helper.js';
|
||||
|
||||
// Configuration
|
||||
const DEFAULT_VOICE_STYLE_PATH = 'assets/voice_styles/M1.json';
|
||||
|
||||
// Helper function to extract filename from path
|
||||
function getFilenameFromPath(path) {
|
||||
return path.split('/').pop();
|
||||
}
|
||||
|
||||
// Global state
|
||||
let textToSpeech = null;
|
||||
let cfgs = null;
|
||||
|
||||
// Pre-computed style
|
||||
let currentStyle = null;
|
||||
let currentStylePath = DEFAULT_VOICE_STYLE_PATH;
|
||||
|
||||
// UI Elements
|
||||
const textInput = document.getElementById('text');
|
||||
const voiceStyleSelect = document.getElementById('voiceStyleSelect');
|
||||
const voiceStyleInfo = document.getElementById('voiceStyleInfo');
|
||||
const langSelect = document.getElementById('langSelect');
|
||||
const totalStepInput = document.getElementById('totalStep');
|
||||
const speedInput = document.getElementById('speed');
|
||||
const generateBtn = document.getElementById('generateBtn');
|
||||
const statusBox = document.getElementById('statusBox');
|
||||
const statusText = document.getElementById('statusText');
|
||||
const backendBadge = document.getElementById('backendBadge');
|
||||
const resultsContainer = document.getElementById('results');
|
||||
const errorBox = document.getElementById('error');
|
||||
|
||||
function showStatus(message, type = 'info') {
|
||||
statusText.innerHTML = message;
|
||||
statusBox.className = 'status-box';
|
||||
if (type === 'success') {
|
||||
statusBox.classList.add('success');
|
||||
} else if (type === 'error') {
|
||||
statusBox.classList.add('error');
|
||||
}
|
||||
}
|
||||
|
||||
function showError(message) {
|
||||
errorBox.textContent = message;
|
||||
errorBox.classList.add('active');
|
||||
}
|
||||
|
||||
function hideError() {
|
||||
errorBox.classList.remove('active');
|
||||
}
|
||||
|
||||
function showBackendBadge() {
|
||||
backendBadge.classList.add('visible');
|
||||
}
|
||||
|
||||
// Load voice style from JSON
|
||||
async function loadStyleFromJSON(stylePath) {
|
||||
try {
|
||||
const style = await loadVoiceStyle([stylePath], true);
|
||||
return style;
|
||||
} catch (error) {
|
||||
console.error('Error loading voice style:', error);
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
// Load models on page load
|
||||
async function initializeModels() {
|
||||
try {
|
||||
showStatus('ℹ️ <strong>Loading configuration...</strong>');
|
||||
|
||||
const basePath = 'assets/onnx';
|
||||
|
||||
// Try WebGPU first, fallback to WASM
|
||||
let executionProvider = 'wasm';
|
||||
try {
|
||||
const result = await loadTextToSpeech(basePath, {
|
||||
executionProviders: ['webgpu'],
|
||||
graphOptimizationLevel: 'all'
|
||||
}, (modelName, current, total) => {
|
||||
showStatus(`ℹ️ <strong>Loading ONNX models (${current}/${total}):</strong> ${modelName}...`);
|
||||
});
|
||||
|
||||
textToSpeech = result.textToSpeech;
|
||||
cfgs = result.cfgs;
|
||||
|
||||
executionProvider = 'webgpu';
|
||||
backendBadge.textContent = 'WebGPU';
|
||||
backendBadge.style.background = '#4caf50';
|
||||
} catch (webgpuError) {
|
||||
console.log('WebGPU not available, falling back to WebAssembly');
|
||||
|
||||
const result = await loadTextToSpeech(basePath, {
|
||||
executionProviders: ['wasm'],
|
||||
graphOptimizationLevel: 'all'
|
||||
}, (modelName, current, total) => {
|
||||
showStatus(`ℹ️ <strong>Loading ONNX models (${current}/${total}):</strong> ${modelName}...`);
|
||||
});
|
||||
|
||||
textToSpeech = result.textToSpeech;
|
||||
cfgs = result.cfgs;
|
||||
}
|
||||
|
||||
showStatus('ℹ️ <strong>Loading default voice style...</strong>');
|
||||
|
||||
// Load default voice style
|
||||
currentStyle = await loadStyleFromJSON(currentStylePath);
|
||||
voiceStyleInfo.textContent = `${getFilenameFromPath(currentStylePath)} (default)`;
|
||||
|
||||
showStatus(`✅ <strong>Models loaded!</strong> Using ${executionProvider.toUpperCase()}. You can now generate speech.`, 'success');
|
||||
showBackendBadge();
|
||||
|
||||
generateBtn.disabled = false;
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error loading models:', error);
|
||||
showStatus(`❌ <strong>Error loading models:</strong> ${error.message}`, 'error');
|
||||
}
|
||||
}
|
||||
|
||||
// Handle voice style selection
|
||||
voiceStyleSelect.addEventListener('change', async (e) => {
|
||||
const selectedValue = e.target.value;
|
||||
|
||||
if (!selectedValue) return;
|
||||
|
||||
try {
|
||||
generateBtn.disabled = true;
|
||||
showStatus(`ℹ️ <strong>Loading voice style...</strong>`, 'info');
|
||||
|
||||
currentStylePath = selectedValue;
|
||||
currentStyle = await loadStyleFromJSON(currentStylePath);
|
||||
voiceStyleInfo.textContent = getFilenameFromPath(currentStylePath);
|
||||
|
||||
showStatus(`✅ <strong>Voice style loaded:</strong> ${getFilenameFromPath(currentStylePath)}`, 'success');
|
||||
generateBtn.disabled = false;
|
||||
} catch (error) {
|
||||
showError(`Error loading voice style: ${error.message}`);
|
||||
|
||||
// Restore default style
|
||||
currentStylePath = DEFAULT_VOICE_STYLE_PATH;
|
||||
voiceStyleSelect.value = currentStylePath;
|
||||
try {
|
||||
currentStyle = await loadStyleFromJSON(currentStylePath);
|
||||
voiceStyleInfo.textContent = `${getFilenameFromPath(currentStylePath)} (default)`;
|
||||
} catch (styleError) {
|
||||
console.error('Error restoring default style:', styleError);
|
||||
}
|
||||
|
||||
generateBtn.disabled = false;
|
||||
}
|
||||
});
|
||||
|
||||
// Main synthesis function
|
||||
async function generateSpeech() {
|
||||
const text = textInput.value.trim();
|
||||
if (!text) {
|
||||
showError('Please enter some text to synthesize.');
|
||||
return;
|
||||
}
|
||||
|
||||
if (!textToSpeech || !cfgs) {
|
||||
showError('Models are still loading. Please wait.');
|
||||
return;
|
||||
}
|
||||
|
||||
if (!currentStyle) {
|
||||
showError('Voice style is not ready. Please wait.');
|
||||
return;
|
||||
}
|
||||
|
||||
const startTime = Date.now();
|
||||
|
||||
try {
|
||||
generateBtn.disabled = true;
|
||||
hideError();
|
||||
|
||||
// Clear results and show placeholder
|
||||
resultsContainer.innerHTML = `
|
||||
<div class="results-placeholder generating">
|
||||
<div class="results-placeholder-icon">⏳</div>
|
||||
<p>Generating speech...</p>
|
||||
</div>
|
||||
`;
|
||||
|
||||
const totalStep = parseInt(totalStepInput.value);
|
||||
const speed = parseFloat(speedInput.value);
|
||||
const lang = langSelect.value;
|
||||
|
||||
showStatus('ℹ️ <strong>Generating speech from text...</strong>');
|
||||
const tic = Date.now();
|
||||
|
||||
const { wav, duration } = await textToSpeech.call(
|
||||
text,
|
||||
lang,
|
||||
currentStyle,
|
||||
totalStep,
|
||||
speed,
|
||||
0.3,
|
||||
(step, total) => {
|
||||
showStatus(`ℹ️ <strong>Denoising (${step}/${total})...</strong>`);
|
||||
}
|
||||
);
|
||||
|
||||
const toc = Date.now();
|
||||
console.log(`Text-to-speech synthesis: ${((toc - tic) / 1000).toFixed(2)}s`);
|
||||
|
||||
showStatus('ℹ️ <strong>Creating audio file...</strong>');
|
||||
const wavLen = Math.floor(textToSpeech.sampleRate * duration[0]);
|
||||
const wavOut = wav.slice(0, wavLen);
|
||||
|
||||
// Create WAV file
|
||||
const wavBuffer = writeWavFile(wavOut, textToSpeech.sampleRate);
|
||||
const blob = new Blob([wavBuffer], { type: 'audio/wav' });
|
||||
const url = URL.createObjectURL(blob);
|
||||
|
||||
// Calculate total time and audio duration
|
||||
const endTime = Date.now();
|
||||
const totalTimeSec = ((endTime - startTime) / 1000).toFixed(2);
|
||||
const audioDurationSec = duration[0].toFixed(2);
|
||||
|
||||
// Display result with full text
|
||||
resultsContainer.innerHTML = `
|
||||
<div class="result-item">
|
||||
<div class="result-text-container">
|
||||
<div class="result-text-label">Input Text</div>
|
||||
<div class="result-text">${text}</div>
|
||||
</div>
|
||||
<div class="result-info">
|
||||
<div class="info-item">
|
||||
<span>📊 Audio Length</span>
|
||||
<strong>${audioDurationSec}s</strong>
|
||||
</div>
|
||||
<div class="info-item">
|
||||
<span>⏱️ Generation Time</span>
|
||||
<strong>${totalTimeSec}s</strong>
|
||||
</div>
|
||||
</div>
|
||||
<div class="result-player">
|
||||
<audio controls>
|
||||
<source src="${url}" type="audio/wav">
|
||||
</audio>
|
||||
</div>
|
||||
<div class="result-actions">
|
||||
<button onclick="downloadAudio('${url}', 'synthesized_speech.wav')">
|
||||
<span>⬇️</span>
|
||||
<span>Download WAV</span>
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
|
||||
showStatus('✅ <strong>Speech synthesis completed successfully!</strong>', 'success');
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error during synthesis:', error);
|
||||
showStatus(`❌ <strong>Error during synthesis:</strong> ${error.message}`, 'error');
|
||||
showError(`Error during synthesis: ${error.message}`);
|
||||
|
||||
// Restore placeholder
|
||||
resultsContainer.innerHTML = `
|
||||
<div class="results-placeholder">
|
||||
<div class="results-placeholder-icon">🎤</div>
|
||||
<p>Generated speech will appear here</p>
|
||||
</div>
|
||||
`;
|
||||
} finally {
|
||||
generateBtn.disabled = false;
|
||||
}
|
||||
}
|
||||
|
||||
// Download handler (make it global so it can be called from onclick)
|
||||
window.downloadAudio = function(url, filename) {
|
||||
const a = document.createElement('a');
|
||||
a.href = url;
|
||||
a.download = filename;
|
||||
a.click();
|
||||
};
|
||||
|
||||
// Attach generate function to button
|
||||
generateBtn.addEventListener('click', generateSpeech);
|
||||
|
||||
// Initialize on load
|
||||
window.addEventListener('load', async () => {
|
||||
generateBtn.disabled = true;
|
||||
await initializeModels();
|
||||
});
|
||||
21
web/package.json
Normal file
21
web/package.json
Normal file
@@ -0,0 +1,21 @@
|
||||
{
|
||||
"name": "tts-onnx-web",
|
||||
"version": "1.0.0",
|
||||
"description": "TTS inference using ONNX Runtime for Web Browser",
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"dev": "vite",
|
||||
"build": "vite build",
|
||||
"preview": "vite preview"
|
||||
},
|
||||
"keywords": ["tts", "onnx", "speech-synthesis", "web"],
|
||||
"author": "",
|
||||
"license": "MIT",
|
||||
"dependencies": {
|
||||
"fft.js": "^4.0.3",
|
||||
"onnxruntime-web": "^1.17.0"
|
||||
},
|
||||
"devDependencies": {
|
||||
"vite": "^5.0.0"
|
||||
}
|
||||
}
|
||||
453
web/style.css
Normal file
453
web/style.css
Normal file
@@ -0,0 +1,453 @@
|
||||
* {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
box-sizing: border-box;
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
min-height: 100vh;
|
||||
display: flex;
|
||||
justify-content: center;
|
||||
align-items: center;
|
||||
padding: 20px;
|
||||
}
|
||||
|
||||
.container {
|
||||
background: white;
|
||||
border-radius: 20px;
|
||||
padding: 40px;
|
||||
max-width: 1400px;
|
||||
width: 100%;
|
||||
box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
|
||||
}
|
||||
|
||||
.main-content {
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 40px;
|
||||
margin-top: 30px;
|
||||
align-items: start;
|
||||
}
|
||||
|
||||
.left-panel {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
}
|
||||
|
||||
.right-panel {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
height: 100%;
|
||||
}
|
||||
|
||||
@media (max-width: 1024px) {
|
||||
.main-content {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
}
|
||||
|
||||
h1 {
|
||||
color: #333;
|
||||
margin-bottom: 10px;
|
||||
font-size: 2em;
|
||||
}
|
||||
|
||||
.subtitle {
|
||||
color: #666;
|
||||
margin-bottom: 30px;
|
||||
font-size: 1.1em;
|
||||
}
|
||||
|
||||
.section {
|
||||
margin-bottom: 25px;
|
||||
}
|
||||
|
||||
label {
|
||||
display: block;
|
||||
font-weight: 600;
|
||||
color: #333;
|
||||
margin-bottom: 8px;
|
||||
font-size: 0.95em;
|
||||
}
|
||||
|
||||
input[type="file"],
|
||||
textarea,
|
||||
input[type="number"] {
|
||||
width: 100%;
|
||||
padding: 12px;
|
||||
border: 2px solid #e0e0e0;
|
||||
border-radius: 8px;
|
||||
font-size: 1em;
|
||||
transition: border-color 0.3s;
|
||||
}
|
||||
|
||||
input[type="file"]:focus,
|
||||
textarea:focus,
|
||||
input[type="number"]:focus {
|
||||
outline: none;
|
||||
border-color: #667eea;
|
||||
}
|
||||
|
||||
textarea {
|
||||
resize: vertical;
|
||||
min-height: 100px;
|
||||
font-family: inherit;
|
||||
}
|
||||
|
||||
.params-grid {
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 15px;
|
||||
}
|
||||
|
||||
button {
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
color: white;
|
||||
border: none;
|
||||
padding: 15px 30px;
|
||||
font-size: 1.1em;
|
||||
font-weight: 600;
|
||||
border-radius: 8px;
|
||||
cursor: pointer;
|
||||
width: 100%;
|
||||
transition: transform 0.2s, box-shadow 0.2s;
|
||||
}
|
||||
|
||||
button:hover:not(:disabled) {
|
||||
transform: translateY(-2px);
|
||||
box-shadow: 0 5px 20px rgba(102, 126, 234, 0.4);
|
||||
}
|
||||
|
||||
button:disabled {
|
||||
opacity: 0.6;
|
||||
cursor: not-allowed;
|
||||
}
|
||||
|
||||
.status-box {
|
||||
background: #e3f2fd;
|
||||
border-left: 4px solid #2196f3;
|
||||
padding: 15px;
|
||||
margin-bottom: 10px;
|
||||
border-radius: 4px;
|
||||
font-size: 0.9em;
|
||||
color: #1565c0;
|
||||
transition: all 0.3s ease;
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
flex-wrap: wrap;
|
||||
gap: 15px;
|
||||
min-height: 50px;
|
||||
}
|
||||
|
||||
.status-box.success {
|
||||
background: #e8f5e9;
|
||||
border-left-color: #4caf50;
|
||||
color: #2e7d32;
|
||||
}
|
||||
|
||||
.status-box.error {
|
||||
background: #ffebee;
|
||||
border-left-color: #f44336;
|
||||
color: #c62828;
|
||||
}
|
||||
|
||||
.status-text-wrapper {
|
||||
flex: 1;
|
||||
min-width: 200px;
|
||||
}
|
||||
|
||||
.backend-badge {
|
||||
display: inline-block;
|
||||
visibility: hidden;
|
||||
padding: 6px 12px;
|
||||
background: #ff9800;
|
||||
color: white;
|
||||
border-radius: 12px;
|
||||
font-size: 0.85em;
|
||||
font-weight: 600;
|
||||
margin-left: 10px;
|
||||
white-space: nowrap;
|
||||
}
|
||||
|
||||
.backend-badge.visible {
|
||||
visibility: visible;
|
||||
}
|
||||
|
||||
.ref-audio-info {
|
||||
color: #4caf50;
|
||||
font-weight: 700;
|
||||
font-size: 0.95em;
|
||||
}
|
||||
|
||||
.ref-audio-label {
|
||||
margin-bottom: 8px;
|
||||
}
|
||||
|
||||
.ref-audio-label label {
|
||||
display: inline;
|
||||
margin-bottom: 0;
|
||||
}
|
||||
|
||||
|
||||
.results {
|
||||
flex: 1;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
}
|
||||
|
||||
.result-item {
|
||||
background: white;
|
||||
border-radius: 16px;
|
||||
box-shadow: 0 2px 12px rgba(0, 0, 0, 0.08);
|
||||
overflow: hidden;
|
||||
transition: box-shadow 0.3s ease;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
flex: 1;
|
||||
}
|
||||
|
||||
.result-item:hover {
|
||||
box-shadow: 0 4px 20px rgba(0, 0, 0, 0.12);
|
||||
}
|
||||
|
||||
.result-item h3 {
|
||||
color: #667eea;
|
||||
margin-bottom: 15px;
|
||||
font-size: 1.2em;
|
||||
}
|
||||
|
||||
.result-text-container {
|
||||
padding: 20px;
|
||||
background: linear-gradient(135deg, #f8f9ff 0%, #ffffff 100%);
|
||||
border-bottom: 1px solid #e8ecf5;
|
||||
flex: 1;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.result-text-label {
|
||||
font-size: 0.75em;
|
||||
text-transform: uppercase;
|
||||
letter-spacing: 0.5px;
|
||||
color: #667eea;
|
||||
font-weight: 600;
|
||||
margin-bottom: 8px;
|
||||
}
|
||||
|
||||
.result-text {
|
||||
color: #333;
|
||||
line-height: 1.7;
|
||||
font-size: 0.95em;
|
||||
word-wrap: break-word;
|
||||
white-space: pre-wrap;
|
||||
overflow-y: auto;
|
||||
padding-right: 8px;
|
||||
flex: 1;
|
||||
}
|
||||
|
||||
.result-text::-webkit-scrollbar {
|
||||
width: 6px;
|
||||
}
|
||||
|
||||
.result-text::-webkit-scrollbar-track {
|
||||
background: #f0f0f0;
|
||||
border-radius: 3px;
|
||||
}
|
||||
|
||||
.result-text::-webkit-scrollbar-thumb {
|
||||
background: #c0c0c0;
|
||||
border-radius: 3px;
|
||||
}
|
||||
|
||||
.result-text::-webkit-scrollbar-thumb:hover {
|
||||
background: #a0a0a0;
|
||||
}
|
||||
|
||||
.result-info {
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 0;
|
||||
background: #fafbff;
|
||||
}
|
||||
|
||||
.info-item {
|
||||
padding: 16px 20px;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 8px;
|
||||
font-size: 0.9em;
|
||||
color: #666;
|
||||
border-bottom: 1px solid #e8ecf5;
|
||||
}
|
||||
|
||||
.info-item:nth-child(1) {
|
||||
border-right: 1px solid #e8ecf5;
|
||||
}
|
||||
|
||||
.info-item strong {
|
||||
color: #333;
|
||||
font-size: 1.1em;
|
||||
font-weight: 600;
|
||||
margin-left: auto;
|
||||
}
|
||||
|
||||
.result-player {
|
||||
padding: 20px;
|
||||
background: white;
|
||||
}
|
||||
|
||||
.result-item audio {
|
||||
width: 100%;
|
||||
height: 48px;
|
||||
outline: none;
|
||||
}
|
||||
|
||||
.result-item audio:focus {
|
||||
outline: 2px solid #667eea;
|
||||
outline-offset: 2px;
|
||||
border-radius: 4px;
|
||||
}
|
||||
|
||||
.result-actions {
|
||||
padding: 16px 20px 20px;
|
||||
background: white;
|
||||
}
|
||||
|
||||
.result-item button {
|
||||
width: 100%;
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
color: white;
|
||||
border: none;
|
||||
padding: 12px 24px;
|
||||
font-size: 0.95em;
|
||||
font-weight: 600;
|
||||
border-radius: 8px;
|
||||
cursor: pointer;
|
||||
transition: all 0.3s ease;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
gap: 8px;
|
||||
}
|
||||
|
||||
.result-item button:hover {
|
||||
transform: translateY(-2px);
|
||||
box-shadow: 0 4px 16px rgba(102, 126, 234, 0.3);
|
||||
}
|
||||
|
||||
.result-item button:active {
|
||||
transform: translateY(0);
|
||||
}
|
||||
|
||||
@media (max-width: 640px) {
|
||||
.result-info {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
|
||||
.info-item:nth-child(1) {
|
||||
border-right: none;
|
||||
}
|
||||
}
|
||||
|
||||
audio {
|
||||
width: 100%;
|
||||
margin-top: 10px;
|
||||
}
|
||||
|
||||
.error {
|
||||
background: #fee;
|
||||
color: #c00;
|
||||
padding: 15px;
|
||||
border-radius: 8px;
|
||||
margin-top: 20px;
|
||||
display: none;
|
||||
}
|
||||
|
||||
.error.active {
|
||||
display: block;
|
||||
}
|
||||
|
||||
.warning-box {
|
||||
background: #fff3cd;
|
||||
color: #856404;
|
||||
padding: 12px 15px;
|
||||
border-radius: 8px;
|
||||
margin-top: 10px;
|
||||
border-left: 4px solid #ffc107;
|
||||
font-size: 0.9em;
|
||||
display: none;
|
||||
line-height: 1.5;
|
||||
}
|
||||
|
||||
.warning-box.active {
|
||||
display: block;
|
||||
}
|
||||
|
||||
.warning-box::before {
|
||||
content: "⚠️ ";
|
||||
margin-right: 5px;
|
||||
}
|
||||
|
||||
.results-placeholder {
|
||||
background: white;
|
||||
border-radius: 16px;
|
||||
box-shadow: 0 2px 12px rgba(0, 0, 0, 0.08);
|
||||
padding: 60px 40px;
|
||||
text-align: center;
|
||||
color: #999;
|
||||
transition: all 0.3s ease;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
justify-content: center;
|
||||
align-items: center;
|
||||
flex: 1;
|
||||
min-height: 400px;
|
||||
}
|
||||
|
||||
.results-placeholder:hover {
|
||||
box-shadow: 0 4px 20px rgba(0, 0, 0, 0.12);
|
||||
}
|
||||
|
||||
.results-placeholder-icon {
|
||||
font-size: 4em;
|
||||
margin-bottom: 20px;
|
||||
opacity: 0.6;
|
||||
animation: float 3s ease-in-out infinite;
|
||||
}
|
||||
|
||||
.results-placeholder.generating .results-placeholder-icon {
|
||||
animation: spin 2s linear infinite;
|
||||
}
|
||||
|
||||
@keyframes float {
|
||||
0%, 100% {
|
||||
transform: translateY(0px);
|
||||
}
|
||||
50% {
|
||||
transform: translateY(-10px);
|
||||
}
|
||||
}
|
||||
|
||||
@keyframes spin {
|
||||
0% {
|
||||
transform: rotate(0deg);
|
||||
}
|
||||
100% {
|
||||
transform: rotate(360deg);
|
||||
}
|
||||
}
|
||||
|
||||
.results-placeholder p {
|
||||
font-size: 1.05em;
|
||||
color: #888;
|
||||
font-weight: 500;
|
||||
margin: 0;
|
||||
}
|
||||
|
||||
.hidden {
|
||||
display: none;
|
||||
}
|
||||
14
web/vite.config.js
Normal file
14
web/vite.config.js
Normal file
@@ -0,0 +1,14 @@
|
||||
import { defineConfig } from 'vite';
|
||||
|
||||
export default defineConfig({
|
||||
server: {
|
||||
port: 3000,
|
||||
open: true
|
||||
},
|
||||
build: {
|
||||
target: 'esnext'
|
||||
},
|
||||
optimizeDeps: {
|
||||
exclude: ['onnxruntime-web']
|
||||
}
|
||||
});
|
||||
Reference in New Issue
Block a user