stt-vosk-py-node/README.md

136 lines
3.9 KiB
Markdown

# Speech-to-Text POC
A speech-to-text proof of concept that processes audio locally using Vosk without requiring cloud APIs. The system exposes a WebSocket API that any client can connect to for real-time speech recognition.
## Features
- **Local Processing**: Uses Vosk for offline speech recognition
- **WebSocket API**: Server exposes `ws://localhost:3000` for any client to connect
- **Web Interface**: Browser-based demo for testing
- **Docker Support**: Complete containerized solution
- **No Cloud Dependencies**: Everything runs locally
## Quick Start
1. **Download Vosk model:**
```bash
curl -L -o vosk-model-small-en-us-0.15.zip https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip
mv vosk-model-small-en-us-0.15 vosk-model
```
2. **Start with Docker:**
```bash
docker-compose up --build
```
3. **Test the web interface:**
- Open `http://localhost:3000` in your browser
- Click "Start Recording" and speak
- See transcriptions appear in real-time
## WebSocket API Usage
The server exposes a WebSocket endpoint at `ws://localhost:3000` that accepts:
- **Input**: Raw WAV audio data (16kHz, 16-bit, mono)
- **Output**: JSON messages with transcriptions
### Example Client Usage
```javascript
const WebSocket = require('ws');
const fs = require('fs');
const ws = new WebSocket('ws://localhost:3000');
ws.on('open', () => {
// Send WAV audio file
const audioData = fs.readFileSync('audio.wav');
ws.send(audioData);
});
ws.on('message', (data) => {
const message = JSON.parse(data);
if (message.type === 'transcription') {
console.log('Text:', message.text);
}
});
```
See `client-example.js` for a complete Node.js client implementation.
## Local Development Setup
### Prerequisites
- Node.js 14+
- Python 3.8+
- Vosk model (downloaded as above)
### Installation
1. **Install Node.js dependencies:**
```bash
yarn install
```
2. **Install Python dependencies:**
```bash
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
```
3. **Start the server:**
```bash
yarn start
```
## Architecture
- **Backend**: Node.js Express server with WebSocket support
- **Speech Processing**: Python subprocess using Vosk library
- **Frontend**: HTML5 + JavaScript with AudioWorklet for microphone capture
- **Communication**: WebSocket for bidirectional real-time communication
## Supported Audio Formats
- **Input**: WAV files (16kHz, 16-bit, mono preferred)
- **Browser**: Automatic conversion from microphone input
- **API**: Raw audio buffers or WAV format
## Performance Notes
- **Model Size**: Small model (~39MB) for fast loading
- **Latency**: Near real-time processing depending on audio chunk size
- **Accuracy**: Good for clear speech, may vary with background noise
- **Resource Usage**: Lightweight, suitable for local deployment
## Troubleshooting
### Common Issues
1. **Model not found**: Ensure Vosk model is extracted to `./vosk-model/` directory
2. **Python errors**: Check that virtual environment is activated and dependencies installed
3. **WebSocket connection fails**: Verify server is running on port 3000
4. **No audio**: Check browser microphone permissions
### Docker Issues
- **Build failures**: Ensure you have enough disk space for the image
- **Model mounting**: Verify `./vosk-model/` exists before running docker-compose
- **Permission errors**: Check file permissions on the vosk-model directory
## Development
- **Server logs**: `docker-compose logs -f` to see real-time logs
- **Rebuild**: `docker-compose up --build` after code changes
- **Stop**: `docker-compose down` to stop all services
## Model Information
- **Current**: Vosk Small English US (0.15)
- **Size**: ~39MB
- **Languages**: English (US)
- **Accuracy**: Optimized for speed over accuracy
- **Alternatives**: See [Vosk Models](https://alphacephei.com/vosk/models) for other languages/sizes