Scrapezy Bot Documentation
Technical documentation for the Scrapezy web scraping bot
Overview
Scrapezy is a legitimate web scraping service that helps businesses extract structured data from websites for analytics, monitoring, research, and data aggregation purposes. Our AI-powered platform enables users to extract data according to custom schemas without writing code.
Service Purpose
Scrapezy provides:
- Structured Data Extraction: AI-powered extraction of data from web pages according to user-defined schemas
- Business Intelligence: Price monitoring, competitor analysis, and market research data collection
- Data Aggregation: Automated collection and organization of publicly available web data
- Research & Analytics: Academic and commercial research data gathering
User-Agent Identification
The Scrapezy bot identifies itself using the following user-agent string:
ScrapezyBot/1.0 (+https://scrapezy.com/docs/reference/bot)
Bot Behavior & Policies
Rate Limiting
Scrapezy implements responsible crawling practices to minimize server impact:
- Request Delay: Minimum 500ms delay between requests to the same domain
- Configurable Depth: User-configurable maximum crawl depth (default: 3 levels)
- Same-Domain Restriction: Crawls stay within the target domain to prevent unintended scope expansion
- Concurrent Limits: Limited concurrent requests per domain
robots.txt Compliance
Scrapezy respects robots.txt
directives:
- The bot parses and honors
robots.txt
files - Respects
Disallow
directives for theScrapezyBot
user-agent - Respects
Crawl-delay
directives when specified - Falls back to
*
(wildcard) rules when no specific rules are defined
HTTP Message Signatures (Cloudflare Verified Bot)
Scrapezy participates in Cloudflare's Verified Bot program using HTTP Message Signatures:
- Public Key Location:
https://scrapezy.com/.well-known/http-message-signatures-directory
- Signature Algorithm: ES256 (ECDSA with P-256 and SHA-256)
- Signed Components: Includes request path, host, and timestamp
- Key Rotation: Public keys are rotated periodically for security
This allows website owners to cryptographically verify that requests are genuinely from Scrapezy.
Responsible Use Policy
Permitted Uses
- Extracting publicly available data for legitimate business purposes
- Price monitoring and competitive intelligence
- Research and academic data collection
- Data aggregation with proper attribution
- Compliance monitoring and regulatory reporting
Prohibited Uses
Users of Scrapezy are prohibited from:
- Credential stuffing or authentication bypass attempts
- Excessive scraping that degrades server performance
- Scraping private or authenticated data without authorization
- Violating website terms of service or legal restrictions
- Malicious scanning or vulnerability probing
- Copyright infringement or unauthorized content redistribution
Technical Capabilities
Scraping Methods
- Single URL Extraction: Direct data extraction from individual pages
- URL List Processing: Batch processing of provided URL lists
- Sitemap Crawling: Automated discovery via XML sitemaps
- Base Page Crawling: Pattern-based crawling from a starting URL
Data Storage
- Extracted data is stored securely in our PostgreSQL database
- Users control data retention and can delete data at any time
- Data can be exported to JSON, CSV, or Google Sheets
- Datasets can be optionally shared in our marketplace
Scheduling
- Cron-based scheduling for recurring scrapes
- Configurable intervals from minutes to months
- Automatic retry logic for failed requests
Security & Privacy
Data Handling
- All data transmission uses HTTPS encryption
- API authentication via secure API keys
- User data is isolated and access-controlled
- Compliance with GDPR and privacy regulations
Vulnerability Management
- Regular security audits and penetration testing
- Prompt patching of identified vulnerabilities
- Responsible disclosure program for security researchers
- Monitoring and logging for abuse detection
Contact & Support
Bot-Related Issues
If you have concerns about Scrapezy's bot behavior or need to report abuse:
- Email: [email protected]
- Subject: "Bot Issue - [Your Domain]"
- Include: Domain affected, timestamp, and description of issue
Blocking Scrapezy
To block Scrapezy from accessing your website, add to your robots.txt
:
User-agent: ScrapezyBot
Disallow: /
For Cloudflare customers using Verified Bots, you can configure bot management rules in your Cloudflare dashboard.
Reporting Abuse
If you believe Scrapezy is being used to violate your website's terms of service or legal rights:
-
Email [email protected] with:
- Your website domain
- Timestamp of the activity
- Description of the violation
- Any relevant logs or evidence
-
We will investigate within 24-48 hours and take appropriate action, which may include:
- Suspending the user's account
- Blocking access to your domain
- Cooperating with legal authorities if necessary
API Access
Users can also access Scrapezy programmatically via our REST API. For API documentation, see:
Compliance & Certifications
- Cloudflare Verified Bot: Participating in the Verified Bot program
- HTTP Message Signatures: RFC 9421 compliant signatures
- Data Protection: GDPR-compliant data handling
- Security: SOC 2 Type II compliance (in progress)
Updates & Changes
This documentation is maintained to reflect current bot behavior. Material changes to bot behavior will be:
- Updated in this documentation
- Announced via our changelog at scrapezy.com/changelog
- Communicated to active users via email when significant
Last Updated: October 2, 2025