init
This commit is contained in:
118
README.md
Normal file
118
README.md
Normal file
@@ -0,0 +1,118 @@
|
||||
# siliconpin_spider
|
||||
|
||||
A Go-based web crawler with per-domain SQLite storage, robots.txt compliance,
|
||||
randomised polite delays, and Server-Sent Events (SSE) for real-time progress.
|
||||
|
||||
---
|
||||
|
||||
## Requirements
|
||||
|
||||
| Tool | Notes |
|
||||
|------|-------|
|
||||
| Go 1.21+ | |
|
||||
| GCC | Required by `go-sqlite3` (CGO) |
|
||||
|
||||
```bash
|
||||
# Ubuntu / Debian
|
||||
apt install gcc
|
||||
|
||||
# macOS (Xcode CLI tools)
|
||||
xcode-select --install
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Run
|
||||
|
||||
```bash
|
||||
go mod tidy
|
||||
go run main.go
|
||||
# Server → http://localhost:8080
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## On startup
|
||||
|
||||
- Creates **`siliconpin_spider.sqlite`** (domains registry)
|
||||
- Serves `./static/` at `/`
|
||||
- **Resumes** any crawls that were previously registered
|
||||
|
||||
---
|
||||
|
||||
## API
|
||||
|
||||
### `POST /api/add_domain`
|
||||
|
||||
Register a domain and immediately start crawling it.
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8080/api/add_domain \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"domain":"siliconpin.com","Crawl-delay":"20"}'
|
||||
```
|
||||
|
||||
**Body fields**
|
||||
|
||||
| Field | Required | Default | Notes |
|
||||
|-------|----------|---------|-------|
|
||||
| `domain` | ✅ | — | bare domain, scheme/www stripped automatically |
|
||||
| `Crawl-delay` | ❌ | `60` | seconds; actual delay is random in `[N, N*2]` |
|
||||
|
||||
**Response `201`**
|
||||
|
||||
```json
|
||||
{
|
||||
"message": "domain added, crawler started",
|
||||
"domain": "siliconpin.com",
|
||||
"interval": 20,
|
||||
"db_file": "siliconpin.com.sqlite",
|
||||
"sse": "/api/sse/siliconpin.com"
|
||||
}
|
||||
```
|
||||
|
||||
Creates **`siliconpin.com.sqlite`** with table:
|
||||
|
||||
```
|
||||
urls(id, url UNIQUE, created_at, updated_at)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `GET /api/sse/{domain}`
|
||||
|
||||
Stream crawl events for any registered domain as **Server-Sent Events**.
|
||||
|
||||
```bash
|
||||
curl -N http://localhost:8080/api/sse/siliconpin.com
|
||||
curl -N http://localhost:8080/api/sse/cicdhosting.com
|
||||
```
|
||||
|
||||
Each `data:` line is a JSON object:
|
||||
|
||||
```
|
||||
data: {"event":"connected", "data":{"domain":"siliconpin.com"}}
|
||||
data: {"event":"status", "data":{"msg":"fetching robots.txt"}}
|
||||
data: {"event":"robots", "data":{"disallowed":["/admin/"],"robots_delay":10,"effective_delay":20}}
|
||||
data: {"event":"waiting", "data":{"url":"https://siliconpin.com/about","delay_s":27,"queue":4}}
|
||||
data: {"event":"fetching", "data":{"url":"https://siliconpin.com/about"}}
|
||||
data: {"event":"saved", "data":{"url":"…","status":200,"content_type":"text/html"}}
|
||||
data: {"event":"links_found","data":{"url":"…","found":12,"new":8,"queue_len":12}}
|
||||
data: {"event":"skipped", "data":{"url":"…","reason":"robots.txt"}}
|
||||
data: {"event":"error", "data":{"url":"…","msg":"…"}}
|
||||
data: {"event":"done", "data":{"domain":"siliconpin.com","msg":"crawl complete"}}
|
||||
: keepalive
|
||||
```
|
||||
|
||||
Multiple browser tabs / curl processes can listen to the **same** domain stream simultaneously.
|
||||
|
||||
---
|
||||
|
||||
## Crawl behaviour
|
||||
|
||||
1. Fetches `robots.txt`; respects `Disallow` paths and `Crawl-delay`
|
||||
- If `robots.txt` specifies a higher delay than you set, the higher value wins
|
||||
2. BFS queue – same-host HTML links only
|
||||
3. Random delay between requests: **`interval` → `interval × 2`** seconds
|
||||
4. Skips already-visited URLs (checked against the domain's SQLite)
|
||||
5. On restart, existing domains resume from where they left off (unvisited URLs are re-queued from the start URL; already saved URLs are skipped)
|
||||
Reference in New Issue
Block a user