54235ebff858959c71f6f93b746c26ead1c33f53
siliconpin_spider
A Go-based web crawler with per-domain SQLite storage, robots.txt compliance, randomised polite delays, and Server-Sent Events (SSE) for real-time progress.
Requirements
| Tool | Notes |
|---|---|
| Go 1.21+ | |
| GCC | Required by go-sqlite3 (CGO) |
# Ubuntu / Debian
apt install gcc
# macOS (Xcode CLI tools)
xcode-select --install
Run
go mod tidy
go run main.go
# Server → http://localhost:8080
On startup
- Creates
siliconpin_spider.sqlite(domains registry) - Serves
./static/at/ - Resumes any crawls that were previously registered
API
POST /api/add_domain
Register a domain and immediately start crawling it.
curl -X POST http://localhost:8080/api/add_domain \
-H "Content-Type: application/json" \
-d '{"domain":"siliconpin.com","Crawl-delay":"20"}'
Body fields
| Field | Required | Default | Notes |
|---|---|---|---|
domain |
✅ | — | bare domain, scheme/www stripped automatically |
Crawl-delay |
❌ | 60 |
seconds; actual delay is random in [N, N*2] |
Response 201
{
"message": "domain added, crawler started",
"domain": "siliconpin.com",
"interval": 20,
"db_file": "siliconpin.com.sqlite",
"sse": "/api/sse/siliconpin.com"
}
Creates siliconpin.com.sqlite with table:
urls(id, url UNIQUE, created_at, updated_at)
GET /api/sse/{domain}
Stream crawl events for any registered domain as Server-Sent Events.
curl -N http://localhost:8080/api/sse/siliconpin.com
curl -N http://localhost:8080/api/sse/cicdhosting.com
Each data: line is a JSON object:
data: {"event":"connected", "data":{"domain":"siliconpin.com"}}
data: {"event":"status", "data":{"msg":"fetching robots.txt"}}
data: {"event":"robots", "data":{"disallowed":["/admin/"],"robots_delay":10,"effective_delay":20}}
data: {"event":"waiting", "data":{"url":"https://siliconpin.com/about","delay_s":27,"queue":4}}
data: {"event":"fetching", "data":{"url":"https://siliconpin.com/about"}}
data: {"event":"saved", "data":{"url":"…","status":200,"content_type":"text/html"}}
data: {"event":"links_found","data":{"url":"…","found":12,"new":8,"queue_len":12}}
data: {"event":"skipped", "data":{"url":"…","reason":"robots.txt"}}
data: {"event":"error", "data":{"url":"…","msg":"…"}}
data: {"event":"done", "data":{"domain":"siliconpin.com","msg":"crawl complete"}}
: keepalive
Multiple browser tabs / curl processes can listen to the same domain stream simultaneously.
Crawl behaviour
- Fetches
robots.txt; respectsDisallowpaths andCrawl-delay- If
robots.txtspecifies a higher delay than you set, the higher value wins
- If
- BFS queue – same-host HTML links only
- Random delay between requests:
interval→interval × 2seconds - Skips already-visited URLs (checked against the domain's SQLite)
- On restart, existing domains resume from where they left off (unvisited URLs are re-queued from the start URL; already saved URLs are skipped)
Description