Command Palette

Search for a command to run...

Scrapezy Bot Documentation

Technical documentation for the Scrapezy web scraping bot


Overview

Scrapezy is a legitimate web scraping service that helps businesses extract structured data from websites for analytics, monitoring, research, and data aggregation purposes. Our AI-powered platform enables users to extract data according to custom schemas without writing code.

Service Purpose

Scrapezy provides:

  • Structured Data Extraction: AI-powered extraction of data from web pages according to user-defined schemas
  • Business Intelligence: Price monitoring, competitor analysis, and market research data collection
  • Data Aggregation: Automated collection and organization of publicly available web data
  • Research & Analytics: Academic and commercial research data gathering

User-Agent Identification

The Scrapezy bot identifies itself using the following user-agent string:

ScrapezyBot/1.0 (+https://scrapezy.com/docs/reference/bot)

Bot Behavior & Policies

Rate Limiting

Scrapezy implements responsible crawling practices to minimize server impact:

  • Request Delay: Minimum 500ms delay between requests to the same domain
  • Configurable Depth: User-configurable maximum crawl depth (default: 3 levels)
  • Same-Domain Restriction: Crawls stay within the target domain to prevent unintended scope expansion
  • Concurrent Limits: Limited concurrent requests per domain

robots.txt Compliance

Scrapezy respects robots.txt directives:

  • The bot parses and honors robots.txt files
  • Respects Disallow directives for the ScrapezyBot user-agent
  • Respects Crawl-delay directives when specified
  • Falls back to * (wildcard) rules when no specific rules are defined

HTTP Message Signatures (Cloudflare Verified Bot)

Scrapezy participates in Cloudflare's Verified Bot program using HTTP Message Signatures:

  • Public Key Location: https://scrapezy.com/.well-known/http-message-signatures-directory
  • Signature Algorithm: ES256 (ECDSA with P-256 and SHA-256)
  • Signed Components: Includes request path, host, and timestamp
  • Key Rotation: Public keys are rotated periodically for security

This allows website owners to cryptographically verify that requests are genuinely from Scrapezy.

Responsible Use Policy

Permitted Uses

  • Extracting publicly available data for legitimate business purposes
  • Price monitoring and competitive intelligence
  • Research and academic data collection
  • Data aggregation with proper attribution
  • Compliance monitoring and regulatory reporting

Prohibited Uses

Users of Scrapezy are prohibited from:

  • Credential stuffing or authentication bypass attempts
  • Excessive scraping that degrades server performance
  • Scraping private or authenticated data without authorization
  • Violating website terms of service or legal restrictions
  • Malicious scanning or vulnerability probing
  • Copyright infringement or unauthorized content redistribution

Technical Capabilities

Scraping Methods

  1. Single URL Extraction: Direct data extraction from individual pages
  2. URL List Processing: Batch processing of provided URL lists
  3. Sitemap Crawling: Automated discovery via XML sitemaps
  4. Base Page Crawling: Pattern-based crawling from a starting URL

Data Storage

  • Extracted data is stored securely in our PostgreSQL database
  • Users control data retention and can delete data at any time
  • Data can be exported to JSON, CSV, or Google Sheets
  • Datasets can be optionally shared in our marketplace

Scheduling

  • Cron-based scheduling for recurring scrapes
  • Configurable intervals from minutes to months
  • Automatic retry logic for failed requests

Security & Privacy

Data Handling

  • All data transmission uses HTTPS encryption
  • API authentication via secure API keys
  • User data is isolated and access-controlled
  • Compliance with GDPR and privacy regulations

Vulnerability Management

  • Regular security audits and penetration testing
  • Prompt patching of identified vulnerabilities
  • Responsible disclosure program for security researchers
  • Monitoring and logging for abuse detection

Contact & Support

Bot-Related Issues

If you have concerns about Scrapezy's bot behavior or need to report abuse:

  • Email: [email protected]
  • Subject: "Bot Issue - [Your Domain]"
  • Include: Domain affected, timestamp, and description of issue

Blocking Scrapezy

To block Scrapezy from accessing your website, add to your robots.txt:

User-agent: ScrapezyBot
Disallow: /

For Cloudflare customers using Verified Bots, you can configure bot management rules in your Cloudflare dashboard.

Reporting Abuse

If you believe Scrapezy is being used to violate your website's terms of service or legal rights:

  1. Email [email protected] with:

    • Your website domain
    • Timestamp of the activity
    • Description of the violation
    • Any relevant logs or evidence
  2. We will investigate within 24-48 hours and take appropriate action, which may include:

    • Suspending the user's account
    • Blocking access to your domain
    • Cooperating with legal authorities if necessary

API Access

Users can also access Scrapezy programmatically via our REST API. For API documentation, see:

Compliance & Certifications

  • Cloudflare Verified Bot: Participating in the Verified Bot program
  • HTTP Message Signatures: RFC 9421 compliant signatures
  • Data Protection: GDPR-compliant data handling
  • Security: SOC 2 Type II compliance (in progress)

Updates & Changes

This documentation is maintained to reflect current bot behavior. Material changes to bot behavior will be:

  • Updated in this documentation
  • Announced via our changelog at scrapezy.com/changelog
  • Communicated to active users via email when significant

Last Updated: October 2, 2025


Additional Resources