Files
siliconpin_spider/README.md
2026-02-20 20:19:16 +05:30

118 lines
3.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# siliconpin_spider
A Go-based web crawler with per-domain SQLite storage, robots.txt compliance,
randomised polite delays, and Server-Sent Events (SSE) for real-time progress.
---
## Requirements
| Tool | Notes |
|------|-------|
| Go 1.21+ | |
| GCC | Required by `go-sqlite3` (CGO) |
```bash
# Ubuntu / Debian
apt install gcc
# macOS (Xcode CLI tools)
xcode-select --install
```
---
## Run
```bash
go mod tidy
go run main.go
# Server → http://localhost:8080
```
---
## On startup
- Creates **`siliconpin_spider.sqlite`** (domains registry)
- Serves `./static/` at `/`
- **Resumes** any crawls that were previously registered
---
## API
### `POST /api/add_domain`
Register a domain and immediately start crawling it.
```bash
curl -X POST http://localhost:8080/api/add_domain \
-H "Content-Type: application/json" \
-d '{"domain":"siliconpin.com","Crawl-delay":"20"}'
```
**Body fields**
| Field | Required | Default | Notes |
|-------|----------|---------|-------|
| `domain` | ✅ | — | bare domain, scheme/www stripped automatically |
| `Crawl-delay` | ❌ | `60` | seconds; actual delay is random in `[N, N*2]` |
**Response `201`**
```json
{
"message": "domain added, crawler started",
"domain": "siliconpin.com",
"interval": 20,
"db_file": "siliconpin.com.sqlite",
"sse": "/api/sse/siliconpin.com"
}
```
Creates **`siliconpin.com.sqlite`** with table:
```
urls(id, url UNIQUE, created_at, updated_at)
```
---
### `GET /api/sse/{domain}`
Stream crawl events for any registered domain as **Server-Sent Events**.
```bash
curl -N http://localhost:8080/api/sse/siliconpin.com
curl -N http://localhost:8080/api/sse/cicdhosting.com
```
Each `data:` line is a JSON object:
```
data: {"event":"connected", "data":{"domain":"siliconpin.com"}}
data: {"event":"status", "data":{"msg":"fetching robots.txt"}}
data: {"event":"robots", "data":{"disallowed":["/admin/"],"robots_delay":10,"effective_delay":20}}
data: {"event":"waiting", "data":{"url":"https://siliconpin.com/about","delay_s":27,"queue":4}}
data: {"event":"fetching", "data":{"url":"https://siliconpin.com/about"}}
data: {"event":"saved", "data":{"url":"…","status":200,"content_type":"text/html"}}
data: {"event":"links_found","data":{"url":"…","found":12,"new":8,"queue_len":12}}
data: {"event":"skipped", "data":{"url":"…","reason":"robots.txt"}}
data: {"event":"error", "data":{"url":"…","msg":"…"}}
data: {"event":"done", "data":{"domain":"siliconpin.com","msg":"crawl complete"}}
: keepalive
```
Multiple browser tabs / curl processes can listen to the **same** domain stream simultaneously.
---
## Crawl behaviour
1. Fetches `robots.txt`; respects `Disallow` paths and `Crawl-delay`
- If `robots.txt` specifies a higher delay than you set, the higher value wins
2. BFS queue same-host HTML links only
3. Random delay between requests: **`interval``interval × 2`** seconds
4. Skips already-visited URLs (checked against the domain's SQLite)
5. On restart, existing domains resume from where they left off (unvisited URLs are re-queued from the start URL; already saved URLs are skipped)