# siliconpin_spider A Go-based web crawler with per-domain SQLite storage, robots.txt compliance, randomised polite delays, and Server-Sent Events (SSE) for real-time progress. --- ## Requirements | Tool | Notes | |------|-------| | Go 1.21+ | | | GCC | Required by `go-sqlite3` (CGO) | ```bash # Ubuntu / Debian apt install gcc # macOS (Xcode CLI tools) xcode-select --install ``` --- ## Run ```bash go mod tidy go run main.go # Server → http://localhost:8080 ``` --- ## On startup - Creates **`siliconpin_spider.sqlite`** (domains registry) - Serves `./static/` at `/` - **Resumes** any crawls that were previously registered --- ## API ### `POST /api/add_domain` Register a domain and immediately start crawling it. ```bash curl -X POST http://localhost:8080/api/add_domain \ -H "Content-Type: application/json" \ -d '{"domain":"siliconpin.com","Crawl-delay":"20"}' ``` **Body fields** | Field | Required | Default | Notes | |-------|----------|---------|-------| | `domain` | ✅ | — | bare domain, scheme/www stripped automatically | | `Crawl-delay` | ❌ | `60` | seconds; actual delay is random in `[N, N*2]` | **Response `201`** ```json { "message": "domain added, crawler started", "domain": "siliconpin.com", "interval": 20, "db_file": "siliconpin.com.sqlite", "sse": "/api/sse/siliconpin.com" } ``` Creates **`siliconpin.com.sqlite`** with table: ``` urls(id, url UNIQUE, created_at, updated_at) ``` --- ### `GET /api/sse/{domain}` Stream crawl events for any registered domain as **Server-Sent Events**. ```bash curl -N http://localhost:8080/api/sse/siliconpin.com curl -N http://localhost:8080/api/sse/cicdhosting.com ``` Each `data:` line is a JSON object: ``` data: {"event":"connected", "data":{"domain":"siliconpin.com"}} data: {"event":"status", "data":{"msg":"fetching robots.txt"}} data: {"event":"robots", "data":{"disallowed":["/admin/"],"robots_delay":10,"effective_delay":20}} data: {"event":"waiting", "data":{"url":"https://siliconpin.com/about","delay_s":27,"queue":4}} data: {"event":"fetching", "data":{"url":"https://siliconpin.com/about"}} data: {"event":"saved", "data":{"url":"…","status":200,"content_type":"text/html"}} data: {"event":"links_found","data":{"url":"…","found":12,"new":8,"queue_len":12}} data: {"event":"skipped", "data":{"url":"…","reason":"robots.txt"}} data: {"event":"error", "data":{"url":"…","msg":"…"}} data: {"event":"done", "data":{"domain":"siliconpin.com","msg":"crawl complete"}} : keepalive ``` Multiple browser tabs / curl processes can listen to the **same** domain stream simultaneously. --- ## Crawl behaviour 1. Fetches `robots.txt`; respects `Disallow` paths and `Crawl-delay` - If `robots.txt` specifies a higher delay than you set, the higher value wins 2. BFS queue – same-host HTML links only 3. Random delay between requests: **`interval` → `interval × 2`** seconds 4. Skips already-visited URLs (checked against the domain's SQLite) 5. On restart, existing domains resume from where they left off (unvisited URLs are re-queued from the start URL; already saved URLs are skipped)