118 lines
3.1 KiB
Markdown
118 lines
3.1 KiB
Markdown
# siliconpin_spider
|
||
|
||
A Go-based web crawler with per-domain SQLite storage, robots.txt compliance,
|
||
randomised polite delays, and Server-Sent Events (SSE) for real-time progress.
|
||
|
||
---
|
||
|
||
## Requirements
|
||
|
||
| Tool | Notes |
|
||
|------|-------|
|
||
| Go 1.21+ | |
|
||
| GCC | Required by `go-sqlite3` (CGO) |
|
||
|
||
```bash
|
||
# Ubuntu / Debian
|
||
apt install gcc
|
||
|
||
# macOS (Xcode CLI tools)
|
||
xcode-select --install
|
||
```
|
||
|
||
---
|
||
|
||
## Run
|
||
|
||
```bash
|
||
go mod tidy
|
||
go run main.go
|
||
# Server → http://localhost:8080
|
||
```
|
||
|
||
---
|
||
|
||
## On startup
|
||
|
||
- Creates **`siliconpin_spider.sqlite`** (domains registry)
|
||
- Serves `./static/` at `/`
|
||
- **Resumes** any crawls that were previously registered
|
||
|
||
---
|
||
|
||
## API
|
||
|
||
### `POST /api/add_domain`
|
||
|
||
Register a domain and immediately start crawling it.
|
||
|
||
```bash
|
||
curl -X POST http://localhost:8080/api/add_domain \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"domain":"siliconpin.com","Crawl-delay":"20"}'
|
||
```
|
||
|
||
**Body fields**
|
||
|
||
| Field | Required | Default | Notes |
|
||
|-------|----------|---------|-------|
|
||
| `domain` | ✅ | — | bare domain, scheme/www stripped automatically |
|
||
| `Crawl-delay` | ❌ | `60` | seconds; actual delay is random in `[N, N*2]` |
|
||
|
||
**Response `201`**
|
||
|
||
```json
|
||
{
|
||
"message": "domain added, crawler started",
|
||
"domain": "siliconpin.com",
|
||
"interval": 20,
|
||
"db_file": "siliconpin.com.sqlite",
|
||
"sse": "/api/sse/siliconpin.com"
|
||
}
|
||
```
|
||
|
||
Creates **`siliconpin.com.sqlite`** with table:
|
||
|
||
```
|
||
urls(id, url UNIQUE, created_at, updated_at)
|
||
```
|
||
|
||
---
|
||
|
||
### `GET /api/sse/{domain}`
|
||
|
||
Stream crawl events for any registered domain as **Server-Sent Events**.
|
||
|
||
```bash
|
||
curl -N http://localhost:8080/api/sse/siliconpin.com
|
||
curl -N http://localhost:8080/api/sse/cicdhosting.com
|
||
```
|
||
|
||
Each `data:` line is a JSON object:
|
||
|
||
```
|
||
data: {"event":"connected", "data":{"domain":"siliconpin.com"}}
|
||
data: {"event":"status", "data":{"msg":"fetching robots.txt"}}
|
||
data: {"event":"robots", "data":{"disallowed":["/admin/"],"robots_delay":10,"effective_delay":20}}
|
||
data: {"event":"waiting", "data":{"url":"https://siliconpin.com/about","delay_s":27,"queue":4}}
|
||
data: {"event":"fetching", "data":{"url":"https://siliconpin.com/about"}}
|
||
data: {"event":"saved", "data":{"url":"…","status":200,"content_type":"text/html"}}
|
||
data: {"event":"links_found","data":{"url":"…","found":12,"new":8,"queue_len":12}}
|
||
data: {"event":"skipped", "data":{"url":"…","reason":"robots.txt"}}
|
||
data: {"event":"error", "data":{"url":"…","msg":"…"}}
|
||
data: {"event":"done", "data":{"domain":"siliconpin.com","msg":"crawl complete"}}
|
||
: keepalive
|
||
```
|
||
|
||
Multiple browser tabs / curl processes can listen to the **same** domain stream simultaneously.
|
||
|
||
---
|
||
|
||
## Crawl behaviour
|
||
|
||
1. Fetches `robots.txt`; respects `Disallow` paths and `Crawl-delay`
|
||
- If `robots.txt` specifies a higher delay than you set, the higher value wins
|
||
2. BFS queue – same-host HTML links only
|
||
3. Random delay between requests: **`interval` → `interval × 2`** seconds
|
||
4. Skips already-visited URLs (checked against the domain's SQLite)
|
||
5. On restart, existing domains resume from where they left off (unvisited URLs are re-queued from the start URL; already saved URLs are skipped) |