136 lines
3.9 KiB
Markdown
136 lines
3.9 KiB
Markdown
# Speech-to-Text POC
|
|
|
|
A speech-to-text proof of concept that processes audio locally using Vosk without requiring cloud APIs. The system exposes a WebSocket API that any client can connect to for real-time speech recognition.
|
|
|
|
## Features
|
|
|
|
- **Local Processing**: Uses Vosk for offline speech recognition
|
|
- **WebSocket API**: Server exposes `ws://localhost:3000` for any client to connect
|
|
- **Web Interface**: Browser-based demo for testing
|
|
- **Docker Support**: Complete containerized solution
|
|
- **No Cloud Dependencies**: Everything runs locally
|
|
|
|
## Quick Start
|
|
|
|
1. **Download Vosk model:**
|
|
```bash
|
|
curl -L -o vosk-model-small-en-us-0.15.zip https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
|
|
unzip vosk-model-small-en-us-0.15.zip
|
|
mv vosk-model-small-en-us-0.15 vosk-model
|
|
```
|
|
|
|
2. **Start with Docker:**
|
|
```bash
|
|
docker-compose up --build
|
|
```
|
|
|
|
3. **Test the web interface:**
|
|
- Open `http://localhost:3000` in your browser
|
|
- Click "Start Recording" and speak
|
|
- See transcriptions appear in real-time
|
|
|
|
## WebSocket API Usage
|
|
|
|
The server exposes a WebSocket endpoint at `ws://localhost:3000` that accepts:
|
|
|
|
- **Input**: Raw WAV audio data (16kHz, 16-bit, mono)
|
|
- **Output**: JSON messages with transcriptions
|
|
|
|
### Example Client Usage
|
|
|
|
```javascript
|
|
const WebSocket = require('ws');
|
|
const fs = require('fs');
|
|
|
|
const ws = new WebSocket('ws://localhost:3000');
|
|
|
|
ws.on('open', () => {
|
|
// Send WAV audio file
|
|
const audioData = fs.readFileSync('audio.wav');
|
|
ws.send(audioData);
|
|
});
|
|
|
|
ws.on('message', (data) => {
|
|
const message = JSON.parse(data);
|
|
if (message.type === 'transcription') {
|
|
console.log('Text:', message.text);
|
|
}
|
|
});
|
|
```
|
|
|
|
See `client-example.js` for a complete Node.js client implementation.
|
|
|
|
## Local Development Setup
|
|
|
|
### Prerequisites
|
|
- Node.js 14+
|
|
- Python 3.8+
|
|
- Vosk model (downloaded as above)
|
|
|
|
### Installation
|
|
|
|
1. **Install Node.js dependencies:**
|
|
```bash
|
|
yarn install
|
|
```
|
|
|
|
2. **Install Python dependencies:**
|
|
```bash
|
|
python3 -m venv venv
|
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
3. **Start the server:**
|
|
```bash
|
|
yarn start
|
|
```
|
|
|
|
## Architecture
|
|
|
|
- **Backend**: Node.js Express server with WebSocket support
|
|
- **Speech Processing**: Python subprocess using Vosk library
|
|
- **Frontend**: HTML5 + JavaScript with AudioWorklet for microphone capture
|
|
- **Communication**: WebSocket for bidirectional real-time communication
|
|
|
|
## Supported Audio Formats
|
|
|
|
- **Input**: WAV files (16kHz, 16-bit, mono preferred)
|
|
- **Browser**: Automatic conversion from microphone input
|
|
- **API**: Raw audio buffers or WAV format
|
|
|
|
## Performance Notes
|
|
|
|
- **Model Size**: Small model (~39MB) for fast loading
|
|
- **Latency**: Near real-time processing depending on audio chunk size
|
|
- **Accuracy**: Good for clear speech, may vary with background noise
|
|
- **Resource Usage**: Lightweight, suitable for local deployment
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Model not found**: Ensure Vosk model is extracted to `./vosk-model/` directory
|
|
2. **Python errors**: Check that virtual environment is activated and dependencies installed
|
|
3. **WebSocket connection fails**: Verify server is running on port 3000
|
|
4. **No audio**: Check browser microphone permissions
|
|
|
|
### Docker Issues
|
|
|
|
- **Build failures**: Ensure you have enough disk space for the image
|
|
- **Model mounting**: Verify `./vosk-model/` exists before running docker-compose
|
|
- **Permission errors**: Check file permissions on the vosk-model directory
|
|
|
|
## Development
|
|
|
|
- **Server logs**: `docker-compose logs -f` to see real-time logs
|
|
- **Rebuild**: `docker-compose up --build` after code changes
|
|
- **Stop**: `docker-compose down` to stop all services
|
|
|
|
## Model Information
|
|
|
|
- **Current**: Vosk Small English US (0.15)
|
|
- **Size**: ~39MB
|
|
- **Languages**: English (US)
|
|
- **Accuracy**: Optimized for speed over accuracy
|
|
- **Alternatives**: See [Vosk Models](https://alphacephei.com/vosk/models) for other languages/sizes |