init
This commit is contained in:
147
README.md
Normal file
147
README.md
Normal file
@@ -0,0 +1,147 @@
|
||||
# Internet Domain Database
|
||||
|
||||
This project maintains two separate tables for internet domain data:
|
||||
- `domain_root`: Top-Level Domains (TLDs) from IANA
|
||||
- `domain_suffix`: Public Suffix List from Mozilla
|
||||
|
||||
## Database Connection
|
||||
|
||||
- **Host**: l2
|
||||
- **Port**: 3306
|
||||
- **User**: root
|
||||
- **Database**: sp_spider
|
||||
|
||||
## Table Structure
|
||||
|
||||
### domain_root
|
||||
Contains IANA TLD data with unique root domains and soft delete functionality.
|
||||
```sql
|
||||
CREATE TABLE domain_root (
|
||||
id INT AUTO_INCREMENT PRIMARY KEY,
|
||||
root VARCHAR(63) NOT NULL UNIQUE,
|
||||
removed BOOLEAN DEFAULT FALSE,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
|
||||
INDEX idx_root (root),
|
||||
INDEX idx_removed (removed)
|
||||
);
|
||||
```
|
||||
|
||||
### domain_suffix
|
||||
Contains Public Suffix List data with unique suffixes and soft delete functionality.
|
||||
```sql
|
||||
CREATE TABLE domain_suffix (
|
||||
id INT AUTO_INCREMENT PRIMARY KEY,
|
||||
suffix VARCHAR(255) NOT NULL UNIQUE,
|
||||
removed BOOLEAN DEFAULT FALSE,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
|
||||
INDEX idx_suffix (suffix),
|
||||
INDEX idx_removed (removed)
|
||||
);
|
||||
```
|
||||
|
||||
## Setup
|
||||
|
||||
1. Install dependencies:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
2. Create the tables:
|
||||
```bash
|
||||
mariadb -h l2 -u root -p0000 --ssl=FALSE sp_spider < create_separate_tables.sql
|
||||
```
|
||||
|
||||
3. Add the removed column (for existing installations):
|
||||
```bash
|
||||
mariadb -h l2 -u root -p0000 --ssl=FALSE sp_spider < add_removed_column.sql
|
||||
```
|
||||
|
||||
4. Populate the tables:
|
||||
```bash
|
||||
python populate_separate_tables.py 0000
|
||||
```
|
||||
|
||||
5. Update tables with soft delete functionality:
|
||||
```bash
|
||||
python update.py 0000
|
||||
```
|
||||
|
||||
## Data Sources
|
||||
|
||||
- **TLD Data**: https://data.iana.org/TLD/tlds-alpha-by-domain.txt
|
||||
- Contains official Top-Level Domains
|
||||
- Currently: 1,436 TLDs
|
||||
|
||||
- **Public Suffix List**: https://publicsuffix.org/list/public_suffix_list.dat
|
||||
- Contains public domain suffixes including TLDs, ccTLDs, and private domain suffixes
|
||||
- Currently: 10,067 suffixes
|
||||
|
||||
## Usage
|
||||
|
||||
Query active TLDs:
|
||||
```sql
|
||||
SELECT * FROM domain_root WHERE removed = FALSE AND root = 'com';
|
||||
```
|
||||
|
||||
Query removed TLDs:
|
||||
```sql
|
||||
SELECT * FROM domain_root WHERE removed = TRUE;
|
||||
```
|
||||
|
||||
Query active suffixes:
|
||||
```sql
|
||||
SELECT * FROM domain_suffix WHERE removed = FALSE AND suffix LIKE '%.com';
|
||||
```
|
||||
|
||||
Query removed suffixes:
|
||||
```sql
|
||||
SELECT * FROM domain_suffix WHERE removed = TRUE;
|
||||
```
|
||||
|
||||
Get statistics:
|
||||
```sql
|
||||
SELECT
|
||||
'domain_root' as table_name,
|
||||
COUNT(*) as total,
|
||||
SUM(CASE WHEN removed = FALSE THEN 1 ELSE 0 END) as active,
|
||||
SUM(CASE WHEN removed = TRUE THEN 1 ELSE 0 END) as removed
|
||||
FROM domain_root
|
||||
UNION ALL
|
||||
SELECT
|
||||
'domain_suffix' as table_name,
|
||||
COUNT(*) as total,
|
||||
SUM(CASE WHEN removed = FALSE THEN 1 ELSE 0 END) as active,
|
||||
SUM(CASE WHEN removed = TRUE THEN 1 ELSE 0 END) as removed
|
||||
FROM domain_suffix;
|
||||
```
|
||||
|
||||
## Project Files
|
||||
|
||||
- `create_separate_tables.sql` - Table creation script
|
||||
- `add_removed_column.sql` - Script to add removed column for soft delete functionality
|
||||
- `populate_separate_tables.py` - Initial data population script
|
||||
- `update.py` - Update script with soft delete functionality
|
||||
- `requirements.txt` - Python dependencies
|
||||
- `README.md` - This documentation
|
||||
|
||||
## Soft Delete Functionality
|
||||
|
||||
The `update.py` script provides soft delete functionality that:
|
||||
|
||||
1. **Marks entries as removed**: Sets `removed = TRUE` for entries no longer found in source data
|
||||
2. **Adds new entries**: Inserts new entries from source with `removed = FALSE`
|
||||
3. **Restores entries**: Sets `removed = FALSE` for previously removed entries that reappear in source
|
||||
4. **Provides statistics**: Shows counts of active, removed, new, and restored entries
|
||||
|
||||
### Update Process
|
||||
|
||||
The update script:
|
||||
- Fetches latest data from IANA TLD and Public Suffix List
|
||||
- Compares with current database entries
|
||||
- Performs batch updates for efficiency
|
||||
- Handles duplicate entries gracefully with `INSERT IGNORE`
|
||||
- Updates `updated_at` timestamp for all changes
|
||||
|
||||
Run the update script periodically to keep the database synchronized with source data.
|
||||
Reference in New Issue
Block a user