|
||
---|---|---|
.claude | ||
public | ||
.dockerignore | ||
.gitignore | ||
.yarnrc | ||
CLAUDE.md | ||
Dockerfile | ||
README.md | ||
client-example.js | ||
docker-compose.yml | ||
package.json | ||
requirements.md | ||
requirements.txt | ||
server.js | ||
speech_processor.py |
README.md
Speech-to-Text POC
A speech-to-text proof of concept that processes audio locally using Vosk without requiring cloud APIs. The system exposes a WebSocket API that any client can connect to for real-time speech recognition.
Features
- Local Processing: Uses Vosk for offline speech recognition
- WebSocket API: Server exposes
ws://localhost:3000
for any client to connect - Web Interface: Browser-based demo for testing
- Docker Support: Complete containerized solution
- No Cloud Dependencies: Everything runs locally
Quick Start
-
Download Vosk model:
curl -L -o vosk-model-small-en-us-0.15.zip https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip unzip vosk-model-small-en-us-0.15.zip mv vosk-model-small-en-us-0.15 vosk-model
-
Start with Docker:
docker-compose up --build
-
Test the web interface:
- Open
http://localhost:3000
in your browser - Click "Start Recording" and speak
- See transcriptions appear in real-time
- Open
WebSocket API Usage
The server exposes a WebSocket endpoint at ws://localhost:3000
that accepts:
- Input: Raw WAV audio data (16kHz, 16-bit, mono)
- Output: JSON messages with transcriptions
Example Client Usage
const WebSocket = require('ws');
const fs = require('fs');
const ws = new WebSocket('ws://localhost:3000');
ws.on('open', () => {
// Send WAV audio file
const audioData = fs.readFileSync('audio.wav');
ws.send(audioData);
});
ws.on('message', (data) => {
const message = JSON.parse(data);
if (message.type === 'transcription') {
console.log('Text:', message.text);
}
});
See client-example.js
for a complete Node.js client implementation.
Local Development Setup
Prerequisites
- Node.js 14+
- Python 3.8+
- Vosk model (downloaded as above)
Installation
-
Install Node.js dependencies:
yarn install
-
Install Python dependencies:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Start the server:
yarn start
Architecture
- Backend: Node.js Express server with WebSocket support
- Speech Processing: Python subprocess using Vosk library
- Frontend: HTML5 + JavaScript with AudioWorklet for microphone capture
- Communication: WebSocket for bidirectional real-time communication
Supported Audio Formats
- Input: WAV files (16kHz, 16-bit, mono preferred)
- Browser: Automatic conversion from microphone input
- API: Raw audio buffers or WAV format
Performance Notes
- Model Size: Small model (~39MB) for fast loading
- Latency: Near real-time processing depending on audio chunk size
- Accuracy: Good for clear speech, may vary with background noise
- Resource Usage: Lightweight, suitable for local deployment
Troubleshooting
Common Issues
- Model not found: Ensure Vosk model is extracted to
./vosk-model/
directory - Python errors: Check that virtual environment is activated and dependencies installed
- WebSocket connection fails: Verify server is running on port 3000
- No audio: Check browser microphone permissions
Docker Issues
- Build failures: Ensure you have enough disk space for the image
- Model mounting: Verify
./vosk-model/
exists before running docker-compose - Permission errors: Check file permissions on the vosk-model directory
Development
- Server logs:
docker-compose logs -f
to see real-time logs - Rebuild:
docker-compose up --build
after code changes - Stop:
docker-compose down
to stop all services
Model Information
- Current: Vosk Small English US (0.15)
- Size: ~39MB
- Languages: English (US)
- Accuracy: Optimized for speed over accuracy
- Alternatives: See Vosk Models for other languages/sizes