# Speech-to-Text POC A speech-to-text proof of concept that processes audio locally using Vosk without requiring cloud APIs. The system exposes a WebSocket API that any client can connect to for real-time speech recognition. ## Features - **Local Processing**: Uses Vosk for offline speech recognition - **WebSocket API**: Server exposes `ws://localhost:3000` for any client to connect - **Web Interface**: Browser-based demo for testing - **Docker Support**: Complete containerized solution - **No Cloud Dependencies**: Everything runs locally ## Quick Start 1. **Download Vosk model:** ```bash curl -L -o vosk-model-small-en-us-0.15.zip https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip unzip vosk-model-small-en-us-0.15.zip mv vosk-model-small-en-us-0.15 vosk-model ``` 2. **Start with Docker:** ```bash docker-compose up --build ``` 3. **Test the web interface:** - Open `http://localhost:3000` in your browser - Click "Start Recording" and speak - See transcriptions appear in real-time ## WebSocket API Usage The server exposes a WebSocket endpoint at `ws://localhost:3000` that accepts: - **Input**: Raw WAV audio data (16kHz, 16-bit, mono) - **Output**: JSON messages with transcriptions ### Example Client Usage ```javascript const WebSocket = require('ws'); const fs = require('fs'); const ws = new WebSocket('ws://localhost:3000'); ws.on('open', () => { // Send WAV audio file const audioData = fs.readFileSync('audio.wav'); ws.send(audioData); }); ws.on('message', (data) => { const message = JSON.parse(data); if (message.type === 'transcription') { console.log('Text:', message.text); } }); ``` See `client-example.js` for a complete Node.js client implementation. ## Local Development Setup ### Prerequisites - Node.js 14+ - Python 3.8+ - Vosk model (downloaded as above) ### Installation 1. **Install Node.js dependencies:** ```bash yarn install ``` 2. **Install Python dependencies:** ```bash python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt ``` 3. **Start the server:** ```bash yarn start ``` ## Architecture - **Backend**: Node.js Express server with WebSocket support - **Speech Processing**: Python subprocess using Vosk library - **Frontend**: HTML5 + JavaScript with AudioWorklet for microphone capture - **Communication**: WebSocket for bidirectional real-time communication ## Supported Audio Formats - **Input**: WAV files (16kHz, 16-bit, mono preferred) - **Browser**: Automatic conversion from microphone input - **API**: Raw audio buffers or WAV format ## Performance Notes - **Model Size**: Small model (~39MB) for fast loading - **Latency**: Near real-time processing depending on audio chunk size - **Accuracy**: Good for clear speech, may vary with background noise - **Resource Usage**: Lightweight, suitable for local deployment ## Troubleshooting ### Common Issues 1. **Model not found**: Ensure Vosk model is extracted to `./vosk-model/` directory 2. **Python errors**: Check that virtual environment is activated and dependencies installed 3. **WebSocket connection fails**: Verify server is running on port 3000 4. **No audio**: Check browser microphone permissions ### Docker Issues - **Build failures**: Ensure you have enough disk space for the image - **Model mounting**: Verify `./vosk-model/` exists before running docker-compose - **Permission errors**: Check file permissions on the vosk-model directory ## Development - **Server logs**: `docker-compose logs -f` to see real-time logs - **Rebuild**: `docker-compose up --build` after code changes - **Stop**: `docker-compose down` to stop all services ## Model Information - **Current**: Vosk Small English US (0.15) - **Size**: ~39MB - **Languages**: English (US) - **Accuracy**: Optimized for speed over accuracy - **Alternatives**: See [Vosk Models](https://alphacephei.com/vosk/models) for other languages/sizes