init
This commit is contained in:
213
README.md
Normal file
213
README.md
Normal file
@@ -0,0 +1,213 @@
|
||||
# XML Sitemap Generator API
|
||||
|
||||
A high-performance Go-based API for generating XML sitemaps with real-time progress tracking via Server-Sent Events (SSE).
|
||||
|
||||
## Features
|
||||
|
||||
- ✅ **Concurrent Web Crawling** - Fast sitemap generation using goroutines
|
||||
- ✅ **Real-time Progress** - SSE streaming for live updates
|
||||
- ✅ **Multi-user Support** - Handle multiple simultaneous crawls
|
||||
- ✅ **Client Metadata Tracking** - IP, browser, OS, session data stored in SQLite
|
||||
- ✅ **Clean REST API** - Simple endpoints for generate, stream, and download
|
||||
- ✅ **Professional UI** - Beautiful web interface included
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
sitemap-api/
|
||||
├── main.go # Entry point & HTTP server
|
||||
├── handlers/
|
||||
│ └── handler.go # HTTP handlers & SSE streaming
|
||||
├── crawler/
|
||||
│ └── crawler.go # Concurrent web crawler
|
||||
├── database/
|
||||
│ └── db.go # SQLite operations
|
||||
├── models/
|
||||
│ └── site.go # Data structures
|
||||
└── static/
|
||||
└── index.html # Frontend UI
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### `POST /generate-sitemap-xml`
|
||||
Start sitemap generation (backend generates UUID)
|
||||
|
||||
**Request:**
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"max_depth": 3
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"uuid": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"site_id": 123,
|
||||
"status": "processing",
|
||||
"stream_url": "/stream/550e8400-...",
|
||||
"message": "Sitemap generation started"
|
||||
}
|
||||
```
|
||||
|
||||
### `GET /stream/{uuid}`
|
||||
Server-Sent Events stream for real-time progress
|
||||
|
||||
**Events:** `connected`, `started`, `progress`, `complete`, `error`
|
||||
|
||||
### `GET /download/{uuid}`
|
||||
Download generated sitemap XML
|
||||
|
||||
### `GET /sites`
|
||||
List all generated sitemaps
|
||||
|
||||
### `GET /sites/{id}`
|
||||
Get specific site details
|
||||
|
||||
### `DELETE /sites/{id}`
|
||||
Delete a sitemap
|
||||
|
||||
### `GET /health`
|
||||
Health check endpoint
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
- Go 1.21+
|
||||
- SQLite3
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
# Clone/navigate to directory
|
||||
cd sitemap-api
|
||||
|
||||
# Install dependencies
|
||||
go mod download
|
||||
|
||||
# Build
|
||||
go build -o sitemap-api
|
||||
|
||||
# Run
|
||||
./sitemap-api
|
||||
```
|
||||
|
||||
Server starts on **http://localhost:8080**
|
||||
|
||||
### Or run directly:
|
||||
```bash
|
||||
go run main.go
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
1. Open http://localhost:8080 in your browser
|
||||
2. Enter a website URL
|
||||
3. Set crawl depth (1-5)
|
||||
4. Click "Generate Sitemap"
|
||||
5. Watch real-time progress
|
||||
6. Download XML when complete
|
||||
|
||||
## Database Schema
|
||||
|
||||
SQLite database (`sitemap.db`) stores:
|
||||
- **sites** - Crawl sessions with client metadata
|
||||
- **pages** - Discovered URLs with priority/frequency
|
||||
- **sessions** - User session tracking
|
||||
|
||||
## Environment Variables
|
||||
|
||||
- `PORT` - Server port (default: 8080)
|
||||
|
||||
Example:
|
||||
```bash
|
||||
PORT=3000 ./sitemap-api
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **Frontend** sends POST to `/generate-sitemap-xml`
|
||||
2. **Backend** generates UUID, saves metadata, returns UUID
|
||||
3. **Frontend** connects to `/stream/{uuid}` for SSE updates
|
||||
4. **Crawler** runs in goroutine, sends events via channel
|
||||
5. **Handler** streams events to frontend in real-time
|
||||
6. **On completion**, sitemap available at `/download/{uuid}`
|
||||
|
||||
## Multi-User Concurrency
|
||||
|
||||
The `StreamManager` handles concurrent users:
|
||||
- Each UUID maps to a Go channel
|
||||
- Concurrent map with mutex for thread safety
|
||||
- Automatic cleanup after crawl completion
|
||||
- Supports unlimited simultaneous crawls
|
||||
|
||||
## Client Metadata Captured
|
||||
|
||||
- IP Address (with X-Forwarded-For support)
|
||||
- User-Agent
|
||||
- Browser name & version
|
||||
- Operating System
|
||||
- Device Type (Desktop/Mobile/Tablet)
|
||||
- Session ID (cookie-based)
|
||||
- All cookies (JSON)
|
||||
- Referrer
|
||||
|
||||
## Performance
|
||||
|
||||
- Concurrent crawling with goroutines
|
||||
- Configurable concurrency limit (default: 5 parallel requests)
|
||||
- Depth-limited to prevent infinite crawls
|
||||
- Same-domain restriction
|
||||
- Duplicate URL prevention
|
||||
- 10-second HTTP timeout per request
|
||||
|
||||
## Customization
|
||||
|
||||
### Adjust Concurrency
|
||||
Edit `crawler/crawler.go`:
|
||||
```go
|
||||
semaphore := make(chan struct{}, 10) // Increase to 10 concurrent
|
||||
```
|
||||
|
||||
### Change Priority Calculation
|
||||
Modify `calculatePriority()` in `crawler/crawler.go`
|
||||
|
||||
### Add Custom Metadata
|
||||
Extend `models.Site` struct and database schema
|
||||
|
||||
## Production Deployment
|
||||
|
||||
### Recommendations:
|
||||
1. Use reverse proxy (nginx/caddy)
|
||||
2. Enable HTTPS
|
||||
3. Add rate limiting
|
||||
4. Configure CORS properly
|
||||
5. Use PostgreSQL for production (replace SQLite)
|
||||
6. Add authentication
|
||||
7. Implement cleanup jobs for old sitemaps
|
||||
|
||||
### Example nginx config:
|
||||
```nginx
|
||||
location / {
|
||||
proxy_pass http://localhost:8080;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection 'upgrade';
|
||||
proxy_set_header Host $host;
|
||||
proxy_cache_bypass $http_upgrade;
|
||||
|
||||
# SSE support
|
||||
proxy_buffering off;
|
||||
proxy_cache off;
|
||||
}
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions, please open a GitHub issue.
|
||||
Reference in New Issue
Block a user