Schema Validation Guide
Learn how to use schemas to ensure consistent and validated data extraction
Schemas provide a powerful way to ensure consistent data structure and validation in your extractions. This guide covers everything you need to know about using schemas effectively.
What Are Schemas?
Schemas define the expected structure and types of data you want to extract. They act as a contract between your extraction requests and the returned data, ensuring consistency and catching errors early.
Key Benefits
- Consistency: Ensure all extractions return data in the same format
- Validation: Catch missing or incorrectly typed data before processing
- Documentation: Field descriptions serve as built-in documentation
- Type Safety: Specify exact data types for reliable processing
Schema Field Types
Scrapezy supports several field types for comprehensive data modeling:
Basic Types
string
: Text datanumber
: Numeric values (integers or floats)boolean
: True/false valuesdate
: Date values (automatically parsed to ISO format)
Complex Types
array
: Lists of itemsobject
: Nested objects with sub-fields
Basic Schema Example
Here's a simple schema for extracting product information:
{
"name": "Product Schema",
"fields": [
{
"name": "productName",
"type": "string",
"required": true,
"description": "Full product name"
},
{
"name": "price",
"type": "number",
"required": true,
"description": "Price in USD"
},
{
"name": "inStock",
"type": "boolean",
"required": false,
"description": "Whether the product is currently in stock"
},
{
"name": "categories",
"type": "array",
"required": false,
"description": "List of product categories"
}
]
}
Complex Schema Structures
Nested Objects
For complex data structures, use nested objects:
{
"name": "Article Schema",
"fields": [
{
"name": "title",
"type": "string",
"required": true,
"description": "Article title"
},
{
"name": "author",
"type": "object",
"required": true,
"description": "Author information",
"fields": [
{
"name": "name",
"type": "string",
"required": true,
"description": "Author's full name"
},
{
"name": "email",
"type": "string",
"required": false,
"description": "Author's email address"
},
{
"name": "bio",
"type": "string",
"required": false,
"description": "Author biography"
}
]
},
{
"name": "publishDate",
"type": "date",
"required": true,
"description": "Publication date"
}
]
}
Arrays with Object Elements
For arrays containing structured data:
{
"name": "Product Listing Schema",
"fields": [
{
"name": "products",
"type": "array",
"required": true,
"description": "List of products",
"items": {
"type": "object",
"fields": [
{
"name": "name",
"type": "string",
"required": true,
"description": "Product name"
},
{
"name": "price",
"type": "number",
"required": true,
"description": "Product price"
},
{
"name": "rating",
"type": "number",
"required": false,
"description": "Average rating (1-5)"
}
]
}
}
]
}
Using Schemas in API Calls
Inline Schema Definition
Define schemas directly in your API call:
POST https://scrapezy.com/api/extract
Content-Type: application/json
x-api-key: your_api_key
{
"url": "https://example.com/products/laptop",
"prompt": "Extract laptop specifications according to the schema",
"schema": {
"name": "Laptop Specifications",
"fields": [
{
"name": "model",
"type": "string",
"required": true,
"description": "Laptop model name"
},
{
"name": "processor",
"type": "string",
"required": true,
"description": "CPU model and specifications"
},
{
"name": "ram",
"type": "string",
"required": true,
"description": "RAM capacity and type"
},
{
"name": "storage",
"type": "string",
"required": true,
"description": "Storage capacity and type"
},
{
"name": "price",
"type": "number",
"required": true,
"description": "Price in USD"
}
]
}
}
Using Schema References
Reference pre-existing schemas by ID:
POST https://scrapezy.com/api/extract
Content-Type: application/json
x-api-key: your_api_key
{
"url": "https://example.com/products/laptop",
"prompt": "Extract laptop specifications",
"schemaId": "schema_laptop_specs_v1"
}
Schema Validation Results
When using schemas, the API validates the extracted data against your schema definition:
Successful Validation
{
"jobId": "job_abc123",
"status": "completed",
"result": {
"model": "MacBook Pro 16-inch",
"processor": "Apple M3 Pro chip",
"ram": "18GB unified memory",
"storage": "512GB SSD",
"price": 2499
},
"schemaValidation": {
"valid": true,
"errors": []
}
}
Validation Errors
{
"jobId": "job_def456",
"status": "completed",
"result": {
"model": "MacBook Pro 16-inch",
"processor": "Apple M3 Pro chip",
"ram": "18GB unified memory"
// Missing required fields: storage, price
},
"schemaValidation": {
"valid": false,
"errors": [
{
"field": "storage",
"message": "Required field is missing"
},
{
"field": "price",
"message": "Required field is missing"
}
]
}
}
Best Practices
Schema Design
- Start Simple: Begin with basic field types and add complexity as needed
- Required Fields: Only mark fields as required if they're essential
- Clear Descriptions: Use descriptive field names and descriptions
- Consistent Naming: Use consistent naming conventions (e.g., camelCase)
Field Validation
- Type Specificity: Choose the most specific type for each field
- Optional Fields: Mark fields as optional if they might not always be present
- Nested Structure: Use objects for related data to maintain organization
Error Handling
- Check Validation: Always check the
schemaValidation
field in responses - Graceful Degradation: Handle missing optional fields gracefully
- Retry Logic: Implement retry logic for validation failures
Common Schema Patterns
E-commerce Product Schema
{
"name": "E-commerce Product",
"fields": [
{
"name": "name",
"type": "string",
"required": true,
"description": "Product name"
},
{
"name": "price",
"type": "object",
"required": true,
"description": "Price information",
"fields": [
{
"name": "amount",
"type": "number",
"required": true,
"description": "Price amount"
},
{
"name": "currency",
"type": "string",
"required": true,
"description": "Currency code"
}
]
},
{
"name": "availability",
"type": "boolean",
"required": false,
"description": "Product availability"
},
{
"name": "images",
"type": "array",
"required": false,
"description": "Product image URLs"
}
]
}
News Article Schema
{
"name": "News Article",
"fields": [
{
"name": "headline",
"type": "string",
"required": true,
"description": "Article headline"
},
{
"name": "author",
"type": "string",
"required": false,
"description": "Article author"
},
{
"name": "publishDate",
"type": "date",
"required": true,
"description": "Publication date"
},
{
"name": "content",
"type": "string",
"required": true,
"description": "Article content"
},
{
"name": "tags",
"type": "array",
"required": false,
"description": "Article tags"
}
]
}
Next Steps
- Basic Usage Guide - Learn fundamental extraction techniques
- Advanced Usage Guide - Explore complex extraction patterns
- API Reference - Complete API documentation