sitemap-generator-xml-golang/Documentation/README.md

# XML Sitemap Generator API

A high-performance Go-based API for generating XML sitemaps with real-time progress tracking via Server-Sent Events (SSE).

## Features

- ✅ **Concurrent Web Crawling** - Fast sitemap generation using goroutines
- ✅ **Real-time Progress** - SSE streaming for live updates
- ✅ **Multi-user Support** - Handle multiple simultaneous crawls
- ✅ **Client Metadata Tracking** - IP, browser, OS, session data stored in SQLite
- ✅ **Clean REST API** - Simple endpoints for generate, stream, and download
- ✅ **Professional UI** - Beautiful web interface included

## Architecture

```
sitemap-api/
├── main.go              # Entry point & HTTP server
├── handlers/
│   └── handler.go       # HTTP handlers & SSE streaming
├── crawler/
│   └── crawler.go       # Concurrent web crawler
├── database/
│   └── db.go            # SQLite operations
├── models/
│   └── site.go          # Data structures
└── static/
    └── index.html       # Frontend UI
```

## API Endpoints

### `POST /generate-sitemap-xml`
Start sitemap generation (backend generates UUID)

**Request:**
```json
{
  "url": "https://example.com",
  "max_depth": 3
}
```

**Response:**
```json
{
  "uuid": "550e8400-e29b-41d4-a716-446655440000",
  "site_id": 123,
  "status": "processing",
  "stream_url": "/stream/550e8400-...",
  "message": "Sitemap generation started"
}
```

### `GET /stream/{uuid}`
Server-Sent Events stream for real-time progress

**Events:** `connected`, `started`, `progress`, `complete`, `error`

### `GET /download/{uuid}`
Download generated sitemap XML

### `GET /sites`
List all generated sitemaps

### `GET /sites/{id}`
Get specific site details

### `DELETE /sites/{id}`
Delete a sitemap

### `GET /health`
Health check endpoint

## Installation

### Prerequisites
- Go 1.21+
- SQLite3

### Setup

```bash
# Clone/navigate to directory
cd sitemap-api

# Install dependencies
go mod download

# Build
go build -o sitemap-api

# Run
./sitemap-api
```

Server starts on **http://localhost:8080**

### Or run directly:
```bash
go run main.go
```

## Usage

1. Open http://localhost:8080 in your browser
2. Enter a website URL
3. Set crawl depth (1-5)
4. Click "Generate Sitemap"
5. Watch real-time progress
6. Download XML when complete

## Database Schema

SQLite database (`sitemap.db`) stores:
- **sites** - Crawl sessions with client metadata
- **pages** - Discovered URLs with priority/frequency
- **sessions** - User session tracking

## Environment Variables

- `PORT` - Server port (default: 8080)

Example:
```bash
PORT=3000 ./sitemap-api
```

## How It Works

1. **Frontend** sends POST to `/generate-sitemap-xml`
2. **Backend** generates UUID, saves metadata, returns UUID
3. **Frontend** connects to `/stream/{uuid}` for SSE updates
4. **Crawler** runs in goroutine, sends events via channel
5. **Handler** streams events to frontend in real-time
6. **On completion**, sitemap available at `/download/{uuid}`

## Multi-User Concurrency

The `StreamManager` handles concurrent users:
- Each UUID maps to a Go channel
- Concurrent map with mutex for thread safety
- Automatic cleanup after crawl completion
- Supports unlimited simultaneous crawls

## Client Metadata Captured

- IP Address (with X-Forwarded-For support)
- User-Agent
- Browser name & version
- Operating System
- Device Type (Desktop/Mobile/Tablet)
- Session ID (cookie-based)
- All cookies (JSON)
- Referrer

## Performance

- Concurrent crawling with goroutines
- Configurable concurrency limit (default: 5 parallel requests)
- Depth-limited to prevent infinite crawls
- Same-domain restriction
- Duplicate URL prevention
- 10-second HTTP timeout per request

## Customization

### Adjust Concurrency
Edit `crawler/crawler.go`:
```go
semaphore := make(chan struct{}, 10) // Increase to 10 concurrent
```

### Change Priority Calculation
Modify `calculatePriority()` in `crawler/crawler.go`

### Add Custom Metadata
Extend `models.Site` struct and database schema

## Production Deployment

### Recommendations:
1. Use reverse proxy (nginx/caddy)
2. Enable HTTPS
3. Add rate limiting
4. Configure CORS properly
5. Use PostgreSQL for production (replace SQLite)
6. Add authentication
7. Implement cleanup jobs for old sitemaps

### Example nginx config:
```nginx
location / {
    proxy_pass http://localhost:8080;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection 'upgrade';
    proxy_set_header Host $host;
    proxy_cache_bypass $http_upgrade;

    # SSE support
    proxy_buffering off;
    proxy_cache off;
}
```

## License

MIT

## Support

For issues or questions, please open a GitHub issue.