Files
siliconpin_spider/README.md
2026-02-20 20:19:16 +05:30

3.1 KiB
Raw Blame History

siliconpin_spider

A Go-based web crawler with per-domain SQLite storage, robots.txt compliance, randomised polite delays, and Server-Sent Events (SSE) for real-time progress.


Requirements

Tool Notes
Go 1.21+
GCC Required by go-sqlite3 (CGO)
# Ubuntu / Debian
apt install gcc

# macOS (Xcode CLI tools)
xcode-select --install

Run

go mod tidy
go run main.go
# Server → http://localhost:8080

On startup

  • Creates siliconpin_spider.sqlite (domains registry)
  • Serves ./static/ at /
  • Resumes any crawls that were previously registered

API

POST /api/add_domain

Register a domain and immediately start crawling it.

curl -X POST http://localhost:8080/api/add_domain \
  -H "Content-Type: application/json" \
  -d '{"domain":"siliconpin.com","Crawl-delay":"20"}'

Body fields

Field Required Default Notes
domain bare domain, scheme/www stripped automatically
Crawl-delay 60 seconds; actual delay is random in [N, N*2]

Response 201

{
  "message":  "domain added, crawler started",
  "domain":   "siliconpin.com",
  "interval": 20,
  "db_file":  "siliconpin.com.sqlite",
  "sse":      "/api/sse/siliconpin.com"
}

Creates siliconpin.com.sqlite with table:

urls(id, url UNIQUE, created_at, updated_at)

GET /api/sse/{domain}

Stream crawl events for any registered domain as Server-Sent Events.

curl -N http://localhost:8080/api/sse/siliconpin.com
curl -N http://localhost:8080/api/sse/cicdhosting.com

Each data: line is a JSON object:

data: {"event":"connected",  "data":{"domain":"siliconpin.com"}}
data: {"event":"status",     "data":{"msg":"fetching robots.txt"}}
data: {"event":"robots",     "data":{"disallowed":["/admin/"],"robots_delay":10,"effective_delay":20}}
data: {"event":"waiting",    "data":{"url":"https://siliconpin.com/about","delay_s":27,"queue":4}}
data: {"event":"fetching",   "data":{"url":"https://siliconpin.com/about"}}
data: {"event":"saved",      "data":{"url":"…","status":200,"content_type":"text/html"}}
data: {"event":"links_found","data":{"url":"…","found":12,"new":8,"queue_len":12}}
data: {"event":"skipped",    "data":{"url":"…","reason":"robots.txt"}}
data: {"event":"error",      "data":{"url":"…","msg":"…"}}
data: {"event":"done",       "data":{"domain":"siliconpin.com","msg":"crawl complete"}}
: keepalive

Multiple browser tabs / curl processes can listen to the same domain stream simultaneously.


Crawl behaviour

  1. Fetches robots.txt; respects Disallow paths and Crawl-delay
    • If robots.txt specifies a higher delay than you set, the higher value wins
  2. BFS queue same-host HTML links only
  3. Random delay between requests: intervalinterval × 2 seconds
  4. Skips already-visited URLs (checked against the domain's SQLite)
  5. On restart, existing domains resume from where they left off (unvisited URLs are re-queued from the start URL; already saved URLs are skipped)