feat: add STT support, Gemini TTS, and expand usage tracking

- Speech-to-Text: full pipeline with sttCore handler, /v1/audio/transcriptions
  endpoint, sttConfig for OpenAI, Gemini, Groq, Deepgram, AssemblyAI,
  HuggingFace, NVIDIA Parakeet; new 9router-stt skill
- Gemini TTS: add gemini provider with 30 prebuilt voices and TTS_PROVIDER_CONFIG
- Usage: implement GLM (intl/cn) and MiniMax (intl/cn) quota fetchers; refactor
  Gemini CLI usage to use retrieveUserQuota with per-model buckets
- Disabled models: lowdb-backed disabledModelsDb + /api/models/disabled route
- Header search: reusable Zustand store (headerSearchStore) wired into Header
- CLI tools: add Claude Cowork tool card and cowork-settings API
- Providers: introduce mediaPriority sorting in getProvidersByKind, add
  Kimi K2.6, reorder hermes, drop qwen STT kind
- UI: expand media-providers/[kind]/[id] page (+314), enhance OAuthModal,
  ModelSelectModal, ProviderTopology, ProxyPools, ProviderLimits
- Assets: refresh provider PNGs (alicode, byteplus, cloudflare-ai, nvidia,
  ollama, vertex, volcengine-ark) and add aws-polly, fal-ai, jina-ai, recraft,
  runwayml, stability-ai, topaz, black-forest-labs
This commit is contained in:
decolua
2026-05-05 10:32:59 +07:00
parent bfb7d42164
commit d4bc42e1f5
67 changed files with 2930 additions and 234 deletions

View File

@@ -0,0 +1,77 @@
---
name: 9router-stt
description: Speech-to-text via 9Router /v1/audio/transcriptions using OpenAI Whisper / Groq / Gemini / Deepgram / AssemblyAI / NVIDIA / HuggingFace models. Use when the user wants to transcribe audio, convert speech to text, or get subtitles from audio files.
---
# 9Router — Speech-to-Text
Requires `NINEROUTER_URL` (and `NINEROUTER_KEY` if auth enabled). See https://raw.githubusercontent.com/decolua/9router/refs/heads/master/skills/9router/SKILL.md for setup.
## Discover models
```bash
curl $NINEROUTER_URL/v1/models/stt | jq '.data[].id'
```
`model` = STT model ID (e.g. `openai/whisper-1`, `groq/whisper-large-v3`, `deepgram/nova-3`, `gemini/gemini-2.5-flash`).
## Endpoint
`POST $NINEROUTER_URL/v1/audio/transcriptions` (OpenAI Whisper compatible, `multipart/form-data`)
| Field | Required | Notes |
|---|---|---|
| `model` | yes | from `/v1/models/stt` |
| `file` | yes | audio file (mp3, wav, m4a, webm, ogg, flac) |
| `language` | no | ISO-639-1 (e.g. `en`, `vi`) |
| `prompt` | no | hint text to guide transcription |
| `response_format` | no | `json` (default) / `text` / `verbose_json` / `srt` / `vtt` |
| `temperature` | no | 01 |
## Examples
```bash
curl -X POST "$NINEROUTER_URL/v1/audio/transcriptions" \
-H "Authorization: Bearer $NINEROUTER_KEY" \
-F "model=openai/whisper-1" \
-F "file=@audio.mp3" \
-F "language=vi"
```
JS (Node):
```js
import { createReadStream } from "node:fs";
const form = new FormData();
form.append("model", "groq/whisper-large-v3-turbo");
form.append("file", new Blob([await (await import("node:fs/promises")).readFile("audio.mp3")]), "audio.mp3");
const r = await fetch(`${process.env.NINEROUTER_URL}/v1/audio/transcriptions`, {
method: "POST",
headers: { "Authorization": `Bearer ${process.env.NINEROUTER_KEY}` },
body: form,
});
const { text } = await r.json();
console.log(text);
```
## Response shape
Default (`response_format=json`):
```json
{ "text": "Xin chào, đây là bản ghi âm." }
```
`verbose_json` adds `language`, `duration`, `segments[]` with timestamps.
`srt` / `vtt` return subtitle text.
## Provider quirks
| Provider | `model` format | Notes |
|---|---|---|
| `openai` | `whisper-1`, `gpt-4o-transcribe`, `gpt-4o-mini-transcribe` | Native OpenAI shape |
| `groq` | `whisper-large-v3`, `whisper-large-v3-turbo`, `distil-whisper-large-v3-en` | Fastest; OpenAI shape |
| `gemini` | `gemini-2.5-flash`, `gemini-2.5-pro`, `gemini-2.5-flash-lite` | Server converts to `generateContent` with audio inline |
| `deepgram` | `nova-3`, `nova-2`, `whisper-large` | Token auth; server adapts response |
| `assemblyai` | `universal-3-pro`, `universal-2` | Async upload+poll handled server-side |
| `nvidia` | `nvidia/parakeet-ctc-1.1b-asr` | NIM endpoint |
| `huggingface` | `openai/whisper-large-v3`, `openai/whisper-small` | HF Inference API |

View File

@@ -49,6 +49,7 @@ When the user needs a specific capability, fetch that skill's `SKILL.md` from it
| Chat / code-gen | https://raw.githubusercontent.com/decolua/9router/refs/heads/master/skills/9router-chat/SKILL.md |
| Image generation | https://raw.githubusercontent.com/decolua/9router/refs/heads/master/skills/9router-image/SKILL.md |
| Text-to-speech | https://raw.githubusercontent.com/decolua/9router/refs/heads/master/skills/9router-tts/SKILL.md |
| Speech-to-text | https://raw.githubusercontent.com/decolua/9router/refs/heads/master/skills/9router-stt/SKILL.md |
| Embeddings | https://raw.githubusercontent.com/decolua/9router/refs/heads/master/skills/9router-embeddings/SKILL.md |
| Web search | https://raw.githubusercontent.com/decolua/9router/refs/heads/master/skills/9router-web-search/SKILL.md |
| Web fetch (URL → markdown) | https://raw.githubusercontent.com/decolua/9router/refs/heads/master/skills/9router-web-fetch/SKILL.md |

View File

@@ -12,6 +12,7 @@ Drop-in skills for any AI agent (Claude, Cursor, ChatGPT, custom SDK). Just **co
| Chat / code-gen | https://raw.githubusercontent.com/decolua/9router/refs/heads/master/skills/9router-chat/SKILL.md |
| Image generation | https://raw.githubusercontent.com/decolua/9router/refs/heads/master/skills/9router-image/SKILL.md |
| Text-to-speech | https://raw.githubusercontent.com/decolua/9router/refs/heads/master/skills/9router-tts/SKILL.md |
| Speech-to-text | https://raw.githubusercontent.com/decolua/9router/refs/heads/master/skills/9router-stt/SKILL.md |
| Embeddings | https://raw.githubusercontent.com/decolua/9router/refs/heads/master/skills/9router-embeddings/SKILL.md |
| Web search | https://raw.githubusercontent.com/decolua/9router/refs/heads/master/skills/9router-web-search/SKILL.md |
| Web fetch (URL → markdown) | https://raw.githubusercontent.com/decolua/9router/refs/heads/master/skills/9router-web-fetch/SKILL.md |