Scrapezy Bot Documentation

Overview

Scrapezy is a legitimate web scraping service that helps businesses extract structured data from websites for analytics, monitoring, research, and data aggregation purposes. Our AI-powered platform enables users to extract data according to custom schemas without writing code.

Service Purpose

Scrapezy provides:

Structured Data Extraction: AI-powered extraction of data from web pages according to user-defined schemas
Business Intelligence: Price monitoring, competitor analysis, and market research data collection
Data Aggregation: Automated collection and organization of publicly available web data
Research & Analytics: Academic and commercial research data gathering

User-Agent Identification

The Scrapezy bot identifies itself using the following user-agent string:

ScrapezyBot/1.0 (+https://scrapezy.com/docs/reference/bot)

Bot Behavior & Policies

Rate Limiting

Scrapezy implements responsible crawling practices to minimize server impact:

Request Delay: Minimum 500ms delay between requests to the same domain
Configurable Depth: User-configurable maximum crawl depth (default: 3 levels)
Same-Domain Restriction: Crawls stay within the target domain to prevent unintended scope expansion
Concurrent Limits: Limited concurrent requests per domain

robots.txt Compliance

Scrapezy respects robots.txt directives:

The bot parses and honors robots.txt files
Respects Disallow directives for the ScrapezyBot user-agent
Respects Crawl-delay directives when specified
Falls back to * (wildcard) rules when no specific rules are defined

HTTP Message Signatures (Cloudflare Verified Bot)

Scrapezy participates in Cloudflare's Verified Bot program using HTTP Message Signatures:

Public Key Location: https://scrapezy.com/.well-known/http-message-signatures-directory
Signature Algorithm: ES256 (ECDSA with P-256 and SHA-256)
Signed Components: Includes request path, host, and timestamp
Key Rotation: Public keys are rotated periodically for security

This allows website owners to cryptographically verify that requests are genuinely from Scrapezy.

Responsible Use Policy

Permitted Uses

Extracting publicly available data for legitimate business purposes
Price monitoring and competitive intelligence
Research and academic data collection
Data aggregation with proper attribution
Compliance monitoring and regulatory reporting

Prohibited Uses

Users of Scrapezy are prohibited from:

Credential stuffing or authentication bypass attempts
Excessive scraping that degrades server performance
Scraping private or authenticated data without authorization
Violating website terms of service or legal restrictions
Malicious scanning or vulnerability probing
Copyright infringement or unauthorized content redistribution

Technical Capabilities

Scraping Methods

Single URL Extraction: Direct data extraction from individual pages
URL List Processing: Batch processing of provided URL lists
Sitemap Crawling: Automated discovery via XML sitemaps
Base Page Crawling: Pattern-based crawling from a starting URL

Data Storage

Extracted data is stored securely in our PostgreSQL database
Users control data retention and can delete data at any time
Data can be exported to JSON, CSV, or Google Sheets
Datasets can be optionally shared in our marketplace

Scheduling

Cron-based scheduling for recurring scrapes
Configurable intervals from minutes to months
Automatic retry logic for failed requests

Security & Privacy

Data Handling

All data transmission uses HTTPS encryption
API authentication via secure API keys
User data is isolated and access-controlled
Compliance with GDPR and privacy regulations

Vulnerability Management

Regular security audits and penetration testing
Prompt patching of identified vulnerabilities
Responsible disclosure program for security researchers
Monitoring and logging for abuse detection

Contact & Support

Bot-Related Issues

If you have concerns about Scrapezy's bot behavior or need to report abuse:

Email: [email protected]
Subject: "Bot Issue - [Your Domain]"
Include: Domain affected, timestamp, and description of issue

Blocking Scrapezy

To block Scrapezy from accessing your website, add to your robots.txt:

User-agent: ScrapezyBot
Disallow: /

For Cloudflare customers using Verified Bots, you can configure bot management rules in your Cloudflare dashboard.

Reporting Abuse

If you believe Scrapezy is being used to violate your website's terms of service or legal rights:

Email [email protected] with:
- Your website domain
- Timestamp of the activity
- Description of the violation
- Any relevant logs or evidence
We will investigate within 24-48 hours and take appropriate action, which may include:
- Suspending the user's account
- Blocking access to your domain
- Cooperating with legal authorities if necessary

API Access

Users can also access Scrapezy programmatically via our REST API. For API documentation, see:

Compliance & Certifications

Cloudflare Verified Bot: Participating in the Verified Bot program
HTTP Message Signatures: RFC 9421 compliant signatures
Data Protection: GDPR-compliant data handling
Security: SOC 2 Type II compliance (in progress)

Updates & Changes

This documentation is maintained to reflect current bot behavior. Material changes to bot behavior will be:

Updated in this documentation
Announced via our changelog at scrapezy.com/changelog
Communicated to active users via email when significant

Last Updated: October 2, 2025

Getting Started

Guides

Troubleshooting

Integrations

Reference