stt-vosk-py-node/README.md

# Speech-to-Text POC

A speech-to-text proof of concept that processes audio locally using Vosk without requiring cloud APIs. The system exposes a WebSocket API that any client can connect to for real-time speech recognition.

## Features

- **Local Processing**: Uses Vosk for offline speech recognition
- **WebSocket API**: Server exposes `ws://localhost:3000` for any client to connect
- **Web Interface**: Browser-based demo for testing
- **Docker Support**: Complete containerized solution
- **No Cloud Dependencies**: Everything runs locally

## Quick Start

1. **Download Vosk model:**
   ```bash
   curl -L -o vosk-model-small-en-us-0.15.zip https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
   unzip vosk-model-small-en-us-0.15.zip
   mv vosk-model-small-en-us-0.15 vosk-model
   ```

2. **Start with Docker:**
   ```bash
   docker-compose up --build
   ```

3. **Test the web interface:**
   - Open `http://localhost:3000` in your browser
   - Click "Start Recording" and speak
   - See transcriptions appear in real-time

## WebSocket API Usage

The server exposes a WebSocket endpoint at `ws://localhost:3000` that accepts:

- **Input**: Raw WAV audio data (16kHz, 16-bit, mono)
- **Output**: JSON messages with transcriptions

### Example Client Usage

```javascript
const WebSocket = require('ws');
const fs = require('fs');

const ws = new WebSocket('ws://localhost:3000');

ws.on('open', () => {
    // Send WAV audio file
    const audioData = fs.readFileSync('audio.wav');
    ws.send(audioData);
});

ws.on('message', (data) => {
    const message = JSON.parse(data);
    if (message.type === 'transcription') {
        console.log('Text:', message.text);
    }
});
```

See `client-example.js` for a complete Node.js client implementation.

## Local Development Setup

### Prerequisites
- Node.js 14+
- Python 3.8+
- Vosk model (downloaded as above)

### Installation

1. **Install Node.js dependencies:**
   ```bash
   yarn install
   ```

2. **Install Python dependencies:**
   ```bash
   python3 -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
   pip install -r requirements.txt
   ```

3. **Start the server:**
   ```bash
   yarn start
   ```

## Architecture

- **Backend**: Node.js Express server with WebSocket support
- **Speech Processing**: Python subprocess using Vosk library
- **Frontend**: HTML5 + JavaScript with AudioWorklet for microphone capture
- **Communication**: WebSocket for bidirectional real-time communication

## Supported Audio Formats

- **Input**: WAV files (16kHz, 16-bit, mono preferred)
- **Browser**: Automatic conversion from microphone input
- **API**: Raw audio buffers or WAV format

## Performance Notes

- **Model Size**: Small model (~39MB) for fast loading
- **Latency**: Near real-time processing depending on audio chunk size
- **Accuracy**: Good for clear speech, may vary with background noise
- **Resource Usage**: Lightweight, suitable for local deployment

## Troubleshooting

### Common Issues

1. **Model not found**: Ensure Vosk model is extracted to `./vosk-model/` directory
2. **Python errors**: Check that virtual environment is activated and dependencies installed
3. **WebSocket connection fails**: Verify server is running on port 3000
4. **No audio**: Check browser microphone permissions

### Docker Issues

- **Build failures**: Ensure you have enough disk space for the image
- **Model mounting**: Verify `./vosk-model/` exists before running docker-compose
- **Permission errors**: Check file permissions on the vosk-model directory

## Development

- **Server logs**: `docker-compose logs -f` to see real-time logs
- **Rebuild**: `docker-compose up --build` after code changes
- **Stop**: `docker-compose down` to stop all services

## Model Information

- **Current**: Vosk Small English US (0.15)
- **Size**: ~39MB
- **Languages**: English (US)
- **Accuracy**: Optimized for speed over accuracy
- **Alternatives**: See [Vosk Models](https://alphacephei.com/vosk/models) for other languages/sizes