214 lines
4.6 KiB
Markdown
214 lines
4.6 KiB
Markdown
# XML Sitemap Generator API
|
|
|
|
A high-performance Go-based API for generating XML sitemaps with real-time progress tracking via Server-Sent Events (SSE).
|
|
|
|
## Features
|
|
|
|
- ✅ **Concurrent Web Crawling** - Fast sitemap generation using goroutines
|
|
- ✅ **Real-time Progress** - SSE streaming for live updates
|
|
- ✅ **Multi-user Support** - Handle multiple simultaneous crawls
|
|
- ✅ **Client Metadata Tracking** - IP, browser, OS, session data stored in SQLite
|
|
- ✅ **Clean REST API** - Simple endpoints for generate, stream, and download
|
|
- ✅ **Professional UI** - Beautiful web interface included
|
|
|
|
## Architecture
|
|
|
|
```
|
|
sitemap-api/
|
|
├── main.go # Entry point & HTTP server
|
|
├── handlers/
|
|
│ └── handler.go # HTTP handlers & SSE streaming
|
|
├── crawler/
|
|
│ └── crawler.go # Concurrent web crawler
|
|
├── database/
|
|
│ └── db.go # SQLite operations
|
|
├── models/
|
|
│ └── site.go # Data structures
|
|
└── static/
|
|
└── index.html # Frontend UI
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
### `POST /generate-sitemap-xml`
|
|
Start sitemap generation (backend generates UUID)
|
|
|
|
**Request:**
|
|
```json
|
|
{
|
|
"url": "https://example.com",
|
|
"max_depth": 3
|
|
}
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"uuid": "550e8400-e29b-41d4-a716-446655440000",
|
|
"site_id": 123,
|
|
"status": "processing",
|
|
"stream_url": "/stream/550e8400-...",
|
|
"message": "Sitemap generation started"
|
|
}
|
|
```
|
|
|
|
### `GET /stream/{uuid}`
|
|
Server-Sent Events stream for real-time progress
|
|
|
|
**Events:** `connected`, `started`, `progress`, `complete`, `error`
|
|
|
|
### `GET /download/{uuid}`
|
|
Download generated sitemap XML
|
|
|
|
### `GET /sites`
|
|
List all generated sitemaps
|
|
|
|
### `GET /sites/{id}`
|
|
Get specific site details
|
|
|
|
### `DELETE /sites/{id}`
|
|
Delete a sitemap
|
|
|
|
### `GET /health`
|
|
Health check endpoint
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
- Go 1.21+
|
|
- SQLite3
|
|
|
|
### Setup
|
|
|
|
```bash
|
|
# Clone/navigate to directory
|
|
cd sitemap-api
|
|
|
|
# Install dependencies
|
|
go mod download
|
|
|
|
# Build
|
|
go build -o sitemap-api
|
|
|
|
# Run
|
|
./sitemap-api
|
|
```
|
|
|
|
Server starts on **http://localhost:8080**
|
|
|
|
### Or run directly:
|
|
```bash
|
|
go run main.go
|
|
```
|
|
|
|
## Usage
|
|
|
|
1. Open http://localhost:8080 in your browser
|
|
2. Enter a website URL
|
|
3. Set crawl depth (1-5)
|
|
4. Click "Generate Sitemap"
|
|
5. Watch real-time progress
|
|
6. Download XML when complete
|
|
|
|
## Database Schema
|
|
|
|
SQLite database (`sitemap.db`) stores:
|
|
- **sites** - Crawl sessions with client metadata
|
|
- **pages** - Discovered URLs with priority/frequency
|
|
- **sessions** - User session tracking
|
|
|
|
## Environment Variables
|
|
|
|
- `PORT` - Server port (default: 8080)
|
|
|
|
Example:
|
|
```bash
|
|
PORT=3000 ./sitemap-api
|
|
```
|
|
|
|
## How It Works
|
|
|
|
1. **Frontend** sends POST to `/generate-sitemap-xml`
|
|
2. **Backend** generates UUID, saves metadata, returns UUID
|
|
3. **Frontend** connects to `/stream/{uuid}` for SSE updates
|
|
4. **Crawler** runs in goroutine, sends events via channel
|
|
5. **Handler** streams events to frontend in real-time
|
|
6. **On completion**, sitemap available at `/download/{uuid}`
|
|
|
|
## Multi-User Concurrency
|
|
|
|
The `StreamManager` handles concurrent users:
|
|
- Each UUID maps to a Go channel
|
|
- Concurrent map with mutex for thread safety
|
|
- Automatic cleanup after crawl completion
|
|
- Supports unlimited simultaneous crawls
|
|
|
|
## Client Metadata Captured
|
|
|
|
- IP Address (with X-Forwarded-For support)
|
|
- User-Agent
|
|
- Browser name & version
|
|
- Operating System
|
|
- Device Type (Desktop/Mobile/Tablet)
|
|
- Session ID (cookie-based)
|
|
- All cookies (JSON)
|
|
- Referrer
|
|
|
|
## Performance
|
|
|
|
- Concurrent crawling with goroutines
|
|
- Configurable concurrency limit (default: 5 parallel requests)
|
|
- Depth-limited to prevent infinite crawls
|
|
- Same-domain restriction
|
|
- Duplicate URL prevention
|
|
- 10-second HTTP timeout per request
|
|
|
|
## Customization
|
|
|
|
### Adjust Concurrency
|
|
Edit `crawler/crawler.go`:
|
|
```go
|
|
semaphore := make(chan struct{}, 10) // Increase to 10 concurrent
|
|
```
|
|
|
|
### Change Priority Calculation
|
|
Modify `calculatePriority()` in `crawler/crawler.go`
|
|
|
|
### Add Custom Metadata
|
|
Extend `models.Site` struct and database schema
|
|
|
|
## Production Deployment
|
|
|
|
### Recommendations:
|
|
1. Use reverse proxy (nginx/caddy)
|
|
2. Enable HTTPS
|
|
3. Add rate limiting
|
|
4. Configure CORS properly
|
|
5. Use PostgreSQL for production (replace SQLite)
|
|
6. Add authentication
|
|
7. Implement cleanup jobs for old sitemaps
|
|
|
|
### Example nginx config:
|
|
```nginx
|
|
location / {
|
|
proxy_pass http://localhost:8080;
|
|
proxy_http_version 1.1;
|
|
proxy_set_header Upgrade $http_upgrade;
|
|
proxy_set_header Connection 'upgrade';
|
|
proxy_set_header Host $host;
|
|
proxy_cache_bypass $http_upgrade;
|
|
|
|
# SSE support
|
|
proxy_buffering off;
|
|
proxy_cache off;
|
|
}
|
|
```
|
|
|
|
## License
|
|
|
|
MIT
|
|
|
|
## Support
|
|
|
|
For issues or questions, please open a GitHub issue.
|