This commit is contained in:
Kar
2026-02-05 19:13:45 +05:30
commit 10b19d4ed6
13 changed files with 2828 additions and 0 deletions

213
README.md Normal file
View File

@@ -0,0 +1,213 @@
# XML Sitemap Generator API
A high-performance Go-based API for generating XML sitemaps with real-time progress tracking via Server-Sent Events (SSE).
## Features
-**Concurrent Web Crawling** - Fast sitemap generation using goroutines
-**Real-time Progress** - SSE streaming for live updates
-**Multi-user Support** - Handle multiple simultaneous crawls
-**Client Metadata Tracking** - IP, browser, OS, session data stored in SQLite
-**Clean REST API** - Simple endpoints for generate, stream, and download
-**Professional UI** - Beautiful web interface included
## Architecture
```
sitemap-api/
├── main.go # Entry point & HTTP server
├── handlers/
│ └── handler.go # HTTP handlers & SSE streaming
├── crawler/
│ └── crawler.go # Concurrent web crawler
├── database/
│ └── db.go # SQLite operations
├── models/
│ └── site.go # Data structures
└── static/
└── index.html # Frontend UI
```
## API Endpoints
### `POST /generate-sitemap-xml`
Start sitemap generation (backend generates UUID)
**Request:**
```json
{
"url": "https://example.com",
"max_depth": 3
}
```
**Response:**
```json
{
"uuid": "550e8400-e29b-41d4-a716-446655440000",
"site_id": 123,
"status": "processing",
"stream_url": "/stream/550e8400-...",
"message": "Sitemap generation started"
}
```
### `GET /stream/{uuid}`
Server-Sent Events stream for real-time progress
**Events:** `connected`, `started`, `progress`, `complete`, `error`
### `GET /download/{uuid}`
Download generated sitemap XML
### `GET /sites`
List all generated sitemaps
### `GET /sites/{id}`
Get specific site details
### `DELETE /sites/{id}`
Delete a sitemap
### `GET /health`
Health check endpoint
## Installation
### Prerequisites
- Go 1.21+
- SQLite3
### Setup
```bash
# Clone/navigate to directory
cd sitemap-api
# Install dependencies
go mod download
# Build
go build -o sitemap-api
# Run
./sitemap-api
```
Server starts on **http://localhost:8080**
### Or run directly:
```bash
go run main.go
```
## Usage
1. Open http://localhost:8080 in your browser
2. Enter a website URL
3. Set crawl depth (1-5)
4. Click "Generate Sitemap"
5. Watch real-time progress
6. Download XML when complete
## Database Schema
SQLite database (`sitemap.db`) stores:
- **sites** - Crawl sessions with client metadata
- **pages** - Discovered URLs with priority/frequency
- **sessions** - User session tracking
## Environment Variables
- `PORT` - Server port (default: 8080)
Example:
```bash
PORT=3000 ./sitemap-api
```
## How It Works
1. **Frontend** sends POST to `/generate-sitemap-xml`
2. **Backend** generates UUID, saves metadata, returns UUID
3. **Frontend** connects to `/stream/{uuid}` for SSE updates
4. **Crawler** runs in goroutine, sends events via channel
5. **Handler** streams events to frontend in real-time
6. **On completion**, sitemap available at `/download/{uuid}`
## Multi-User Concurrency
The `StreamManager` handles concurrent users:
- Each UUID maps to a Go channel
- Concurrent map with mutex for thread safety
- Automatic cleanup after crawl completion
- Supports unlimited simultaneous crawls
## Client Metadata Captured
- IP Address (with X-Forwarded-For support)
- User-Agent
- Browser name & version
- Operating System
- Device Type (Desktop/Mobile/Tablet)
- Session ID (cookie-based)
- All cookies (JSON)
- Referrer
## Performance
- Concurrent crawling with goroutines
- Configurable concurrency limit (default: 5 parallel requests)
- Depth-limited to prevent infinite crawls
- Same-domain restriction
- Duplicate URL prevention
- 10-second HTTP timeout per request
## Customization
### Adjust Concurrency
Edit `crawler/crawler.go`:
```go
semaphore := make(chan struct{}, 10) // Increase to 10 concurrent
```
### Change Priority Calculation
Modify `calculatePriority()` in `crawler/crawler.go`
### Add Custom Metadata
Extend `models.Site` struct and database schema
## Production Deployment
### Recommendations:
1. Use reverse proxy (nginx/caddy)
2. Enable HTTPS
3. Add rate limiting
4. Configure CORS properly
5. Use PostgreSQL for production (replace SQLite)
6. Add authentication
7. Implement cleanup jobs for old sitemaps
### Example nginx config:
```nginx
location / {
proxy_pass http://localhost:8080;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
# SSE support
proxy_buffering off;
proxy_cache off;
}
```
## License
MIT
## Support
For issues or questions, please open a GitHub issue.