init

2025-06-05 14:31:34 +05:30
commit 9989eeb879
16 changed files with 1015 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,136 @@
+# Speech-to-Text POC
+
+A speech-to-text proof of concept that processes audio locally using Vosk without requiring cloud APIs. The system exposes a WebSocket API that any client can connect to for real-time speech recognition.
+
+## Features
+
+- **Local Processing**: Uses Vosk for offline speech recognition
+- **WebSocket API**: Server exposes `ws://localhost:3000` for any client to connect
+- **Web Interface**: Browser-based demo for testing
+- **Docker Support**: Complete containerized solution
+- **No Cloud Dependencies**: Everything runs locally
+
+## Quick Start
+
+1. **Download Vosk model:**
+   ```bash
+   curl -L -o vosk-model-small-en-us-0.15.zip https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
+   unzip vosk-model-small-en-us-0.15.zip
+   mv vosk-model-small-en-us-0.15 vosk-model
+   ```
+
+2. **Start with Docker:**
+   ```bash
+   docker-compose up --build
+   ```
+
+3. **Test the web interface:**
+   - Open `http://localhost:3000` in your browser
+   - Click "Start Recording" and speak
+   - See transcriptions appear in real-time
+
+## WebSocket API Usage
+
+The server exposes a WebSocket endpoint at `ws://localhost:3000` that accepts:
+
+- **Input**: Raw WAV audio data (16kHz, 16-bit, mono)
+- **Output**: JSON messages with transcriptions
+
+### Example Client Usage
+
+```javascript
+const WebSocket = require('ws');
+const fs = require('fs');
+
+const ws = new WebSocket('ws://localhost:3000');
+
+ws.on('open', () => {
+    // Send WAV audio file
+    const audioData = fs.readFileSync('audio.wav');
+    ws.send(audioData);
+});
+
+ws.on('message', (data) => {
+    const message = JSON.parse(data);
+    if (message.type === 'transcription') {
+        console.log('Text:', message.text);
+    }
+});
+```
+
+See `client-example.js` for a complete Node.js client implementation.
+
+## Local Development Setup
+
+### Prerequisites
+- Node.js 14+
+- Python 3.8+
+- Vosk model (downloaded as above)
+
+### Installation
+
+1. **Install Node.js dependencies:**
+   ```bash
+   yarn install
+   ```
+
+2. **Install Python dependencies:**
+   ```bash
+   python3 -m venv venv
+   source venv/bin/activate  # On Windows: venv\Scripts\activate
+   pip install -r requirements.txt
+   ```
+
+3. **Start the server:**
+   ```bash
+   yarn start
+   ```
+
+## Architecture
+
+- **Backend**: Node.js Express server with WebSocket support
+- **Speech Processing**: Python subprocess using Vosk library
+- **Frontend**: HTML5 + JavaScript with AudioWorklet for microphone capture
+- **Communication**: WebSocket for bidirectional real-time communication
+
+## Supported Audio Formats
+
+- **Input**: WAV files (16kHz, 16-bit, mono preferred)
+- **Browser**: Automatic conversion from microphone input
+- **API**: Raw audio buffers or WAV format
+
+## Performance Notes
+
+- **Model Size**: Small model (~39MB) for fast loading
+- **Latency**: Near real-time processing depending on audio chunk size
+- **Accuracy**: Good for clear speech, may vary with background noise
+- **Resource Usage**: Lightweight, suitable for local deployment
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Model not found**: Ensure Vosk model is extracted to `./vosk-model/` directory
+2. **Python errors**: Check that virtual environment is activated and dependencies installed
+3. **WebSocket connection fails**: Verify server is running on port 3000
+4. **No audio**: Check browser microphone permissions
+
+### Docker Issues
+
+- **Build failures**: Ensure you have enough disk space for the image
+- **Model mounting**: Verify `./vosk-model/` exists before running docker-compose
+- **Permission errors**: Check file permissions on the vosk-model directory
+
+## Development
+
+- **Server logs**: `docker-compose logs -f` to see real-time logs
+- **Rebuild**: `docker-compose up --build` after code changes
+- **Stop**: `docker-compose down` to stop all services
+
+## Model Information
+
+- **Current**: Vosk Small English US (0.15)
+- **Size**: ~39MB
+- **Languages**: English (US)
+- **Accuracy**: Optimized for speed over accuracy
+- **Alternatives**: See [Vosk Models](https://alphacephei.com/vosk/models) for other languages/sizes