Go to file

Kar b3ac3498fe 5080		2025-06-05 09:14:02 +00:00
.claude	init	2025-06-05 14:31:34 +05:30
public	init	2025-06-05 14:31:34 +05:30
.dockerignore	init	2025-06-05 14:31:34 +05:30
.gitignore	init	2025-06-05 14:31:34 +05:30
.yarnrc	init	2025-06-05 14:31:34 +05:30
CLAUDE.md	init	2025-06-05 14:31:34 +05:30
Dockerfile	Update Dockerfile	2025-06-05 09:13:28 +00:00
README.md	init	2025-06-05 14:31:34 +05:30
client-example.js	init	2025-06-05 14:31:34 +05:30
docker-compose.yml	5080	2025-06-05 09:14:02 +00:00
package.json	init	2025-06-05 14:31:34 +05:30
requirements.md	init	2025-06-05 14:31:34 +05:30
requirements.txt	init	2025-06-05 14:31:34 +05:30
server.js	Update server.js	2025-06-05 09:12:29 +00:00
speech_processor.py	init	2025-06-05 14:31:34 +05:30

README.md

Speech-to-Text POC

A speech-to-text proof of concept that processes audio locally using Vosk without requiring cloud APIs. The system exposes a WebSocket API that any client can connect to for real-time speech recognition.

Features

Local Processing: Uses Vosk for offline speech recognition
WebSocket API: Server exposes ws://localhost:3000 for any client to connect
Web Interface: Browser-based demo for testing
Docker Support: Complete containerized solution
No Cloud Dependencies: Everything runs locally

Quick Start

Download Vosk model:

curl -L -o vosk-model-small-en-us-0.15.zip https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip
mv vosk-model-small-en-us-0.15 vosk-model

Start with Docker:
```
docker-compose up --build
```
Test the web interface:
- Open http://localhost:3000 in your browser
- Click "Start Recording" and speak
- See transcriptions appear in real-time

WebSocket API Usage

The server exposes a WebSocket endpoint at ws://localhost:3000 that accepts:

Input: Raw WAV audio data (16kHz, 16-bit, mono)
Output: JSON messages with transcriptions

Example Client Usage

const WebSocket = require('ws');
const fs = require('fs');

const ws = new WebSocket('ws://localhost:3000');

ws.on('open', () => {
    // Send WAV audio file
    const audioData = fs.readFileSync('audio.wav');
    ws.send(audioData);
});

ws.on('message', (data) => {
    const message = JSON.parse(data);
    if (message.type === 'transcription') {
        console.log('Text:', message.text);
    }
});

See client-example.js for a complete Node.js client implementation.

Local Development Setup

Prerequisites

Node.js 14+
Python 3.8+
Vosk model (downloaded as above)

Installation

Install Node.js dependencies:
```
yarn install
```

Install Python dependencies:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Start the server:
```
yarn start
```

Architecture

Backend: Node.js Express server with WebSocket support
Speech Processing: Python subprocess using Vosk library
Frontend: HTML5 + JavaScript with AudioWorklet for microphone capture
Communication: WebSocket for bidirectional real-time communication

Supported Audio Formats

Input: WAV files (16kHz, 16-bit, mono preferred)
Browser: Automatic conversion from microphone input
API: Raw audio buffers or WAV format

Performance Notes

Model Size: Small model (~39MB) for fast loading
Latency: Near real-time processing depending on audio chunk size
Accuracy: Good for clear speech, may vary with background noise
Resource Usage: Lightweight, suitable for local deployment

Troubleshooting

Common Issues

Model not found: Ensure Vosk model is extracted to ./vosk-model/ directory
Python errors: Check that virtual environment is activated and dependencies installed
WebSocket connection fails: Verify server is running on port 3000
No audio: Check browser microphone permissions

Docker Issues

Build failures: Ensure you have enough disk space for the image
Model mounting: Verify ./vosk-model/ exists before running docker-compose
Permission errors: Check file permissions on the vosk-model directory

Development

Server logs: docker-compose logs -f to see real-time logs
Rebuild: docker-compose up --build after code changes
Stop: docker-compose down to stop all services

Model Information

Current: Vosk Small English US (0.15)
Size: ~39MB
Languages: English (US)
Accuracy: Optimized for speed over accuracy
Alternatives: See Vosk Models for other languages/sizes