Introduction to ScrapixData
Welcome to the ScrapixData developer hub. Our unified API allows you to extract data, bypass any anti-bot protection, and render Javascript using a single HTTP request without maintaining any proxy infrastructure.
Authentication
All API requests require a strict API key to be passed via the `Authorization: Bearer` header. Your API key explicitly controls authorization to specific geographic proxy pools and bounds your concurrency limits.
Authorization: Bearer sk_live_YOUR_API_KEY_HERE
Content-Type: application/json
POST /v1/scrape
The primary endpoint to initiate a scraping job. It runs synchronously by default for lightweight pages, but automatically upgrades to cloud scraping instances if JS rendering or ASP is requested.
| Parameter | Type | Description |
|---|---|---|
| url | string (required) | The target URL to scrape. Must include http:// or https:// |
| render_js | boolean | If true, spins up a headless Chrome instance to render the DOM. |
| asp | boolean | Anti-scraping protection. Bypasses Cloudflare or DataDome automatically. |
| extract_prompt | string | Pass a natural language instruction to return the page as JSON. |
Example: Extraction via cURL
Here is a complete zero-configuration request to bypass Amazon's CAPTCHA and extract a product securely.
curl -X POST "https://api.scrapixdata.io/v1/scrape" \
-H "Authorization: Bearer sk_live_xxxxxxxxxxx" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.target-enterprise.com/data",
"render_js": true,
"advanced_stealth_protection": true,
"proxy": {
"type": "residential",
"country": "GB",
"city": "London",
"session_id": "uk_crawler_001"
},
"actions": [
{ "action": "wait_for_selector", "selector": ".data-grid" },
{ "action": "scroll_bottom" },
{ "action": "wait", "milliseconds": 2000 }
],
"extract_rules": {
"records": {
"selector": ".data-row",
"output": {
"id": "@data-id",
"value": ".price | number",
"status": ".status-badge | text"
}
}
}
}'
Example: Node.js SDK
import { ScrapixClient, ExtractionEngine } from 'scrapix-node';
const client = new ScrapixClient({
apiKey: 'sk_live_xxxxxxxxxxx',
maxRetries: 3,
timeout: 45000 // 45s for heavy DOM renders
});
// Configure massive concurrent extraction
const job = await client.scrape({
url: 'https://financial-data-platform.com/metrics',
method: 'GET',
stealth: {
engine: 'chromium-120-patched',
headers: { 'X-Requested-With': 'XMLHttpRequest' },
solve_captchas: true
},
proxy: {
pool: 'residential',
targeting: { asn: 7922, country: 'US' },
session_id: 'fin_runner_22'
},
extract: {
schema: ExtractionEngine.VisualLLM,
model: 'scrapix-70b-vision',
prompt: 'Extract the entire P&L statement as a strictly typed JSON array of objects'
},
webhook: {
url: 'https://api.your-company.com/webhooks/scrapix',
secret: 'whsec_XXXXXX'
}
});
console.log(`Job queued via Webhook Pipeline. ID: ${job.id}`);
POST /v1/batch
Submit up to 10,000 URLs in a single payload. The batch engine inherently distributes load across 50M+ rotating residential vectors, automatically mitigating ASN bans and managing concurrency windows.
import { ScrapixBatchQueue } from 'scrapix-node';
const batch = new ScrapixBatchQueue('sk_live_xxxxxxxxxxx');
await batch.push([
{ url: 'https://competitor.com/category-A', proxy: 'US' },
{ url: 'https://competitor.com/category-B', proxy: 'GB' }
], {
concurrency_limit: 500,
retry_policy: 'exponential_backoff',
stealth_profile: 'aggresive_human_emulation'
});
POST /v1/extract/llm
Utilize the `scrapix-70b-vision` un-structured DOM parser. Bypass the need to write fragile CSS selectors. You pass the URL and a JSON Schema, and the Vision Engine translates the raw page into strictly typed data.
{
"url": "https://complex-spas.com/dynamic-data",
"render_js": true,
"schema": {
"product_name": "String",
"historical_prices": "Array",
"is_in_stock": "Boolean"
},
"instructions": "Locate the primary item. If stock says 'Backordered', map to false."
}
Webhooks Setup & Security
For operations exceeding 10s (like JS rendering or automated scrolling), synchronous HTTP requests will timeout. Subscribe to our Webhook dispatches to receive data asynchronously as soon as jobs complete.
app.post('/scrapix/webhook', express.raw({ type: 'application/json' }), (req, res) => {
const signature = req.headers['x-scrapix-signature'];
if (crypto.verifyWebhook(req.body, signature, process.env.WH_SECRET)) {
const { job_id, scraped_json, status } = JSON.parse(req.body);
database.saveExtraction(job_id, scraped_json);
res.sendStatus(200);
} else {
res.status(401).send('Invalid signature');
}
});
GET /v1/status
Verify your asynchronous extraction pipelines or check the global proxy pool health. Poll job IDs generated from the Webhook or Batch endpoints using this interface to retrieve JSON buffers.
curl -X GET "https://api.scrapixdata.io/v1/status?job_id=job_xxxxxxx" \
-H "Authorization: Bearer sk_live_xxxxxxxxxxx"
Cloudflare Bypass (ASP)
Our Advanced Stealth Protection (ASP) framework utilizes deep kernel-level TCP/IP fingerprint spoofing. By setting `"advanced_stealth_protection": true`, our AI proxy engine statistically synthesizes Ja3/HTTP2 profiles matching real enterprise endpoints to effortlessly pierce Edge CDNs (Akamai, DataDome, Cloudflare Turnstile).
// ASP configuration injected globally into your request payload
{
"url": "https://protected-site.com",
"advanced_stealth_protection": true, // Activates Fingerprint Spoofing
"solve_captchas": ["datadome", "turnstile"] // Hands off token generation to our backend AI logic
}