Introduction to ScrapixData

Welcome to the ScrapixData developer hub. Our unified API allows you to extract data, bypass any anti-bot protection, and render Javascript using a single HTTP request without maintaining any proxy infrastructure.

Authentication

All API requests require a strict API key to be passed via the `Authorization: Bearer` header. Your API key explicitly controls authorization to specific geographic proxy pools and bounds your concurrency limits.

Authorization: Bearer sk_live_YOUR_API_KEY_HERE
Content-Type: application/json

POST /v1/scrape

The primary endpoint to initiate a scraping job. It runs synchronously by default for lightweight pages, but automatically upgrades to cloud scraping instances if JS rendering or ASP is requested.

Parameter Type Description
url string (required) The target URL to scrape. Must include http:// or https://
render_js boolean If true, spins up a headless Chrome instance to render the DOM.
asp boolean Anti-scraping protection. Bypasses Cloudflare or DataDome automatically.
extract_prompt string Pass a natural language instruction to return the page as JSON.

Example: Extraction via cURL

Here is a complete zero-configuration request to bypass Amazon's CAPTCHA and extract a product securely.

curl -X POST "https://api.scrapixdata.io/v1/scrape" \
     -H "Authorization: Bearer sk_live_xxxxxxxxxxx" \
     -H "Content-Type: application/json" \
     -d '{
           "url": "https://www.target-enterprise.com/data",
           "render_js": true,
           "advanced_stealth_protection": true,
           "proxy": {
               "type": "residential",
               "country": "GB",
               "city": "London",
               "session_id": "uk_crawler_001"
           },
           "actions": [
               { "action": "wait_for_selector", "selector": ".data-grid" },
               { "action": "scroll_bottom" },
               { "action": "wait", "milliseconds": 2000 }
           ],
           "extract_rules": {
               "records": {
                   "selector": ".data-row",
                   "output": {
                       "id": "@data-id",
                       "value": ".price | number",
                       "status": ".status-badge | text"
                   }
               }
           }
         }'

Example: Node.js SDK

import { ScrapixClient, ExtractionEngine } from 'scrapix-node';

const client = new ScrapixClient({
  apiKey: 'sk_live_xxxxxxxxxxx',
  maxRetries: 3,
  timeout: 45000 // 45s for heavy DOM renders
});

// Configure massive concurrent extraction
const job = await client.scrape({
  url: 'https://financial-data-platform.com/metrics',
  method: 'GET',
  stealth: {
    engine: 'chromium-120-patched',
    headers: { 'X-Requested-With': 'XMLHttpRequest' },
    solve_captchas: true
  },
  proxy: {
    pool: 'residential',
    targeting: { asn: 7922, country: 'US' },
    session_id: 'fin_runner_22'
  },
  extract: {
    schema: ExtractionEngine.VisualLLM,
    model: 'scrapix-70b-vision',
    prompt: 'Extract the entire P&L statement as a strictly typed JSON array of objects'
  },
  webhook: {
    url: 'https://api.your-company.com/webhooks/scrapix',
    secret: 'whsec_XXXXXX'
  }
});

console.log(`Job queued via Webhook Pipeline. ID: ${job.id}`);

POST /v1/batch

Submit up to 10,000 URLs in a single payload. The batch engine inherently distributes load across 50M+ rotating residential vectors, automatically mitigating ASN bans and managing concurrency windows.

import { ScrapixBatchQueue } from 'scrapix-node';

const batch = new ScrapixBatchQueue('sk_live_xxxxxxxxxxx');
await batch.push([
    { url: 'https://competitor.com/category-A', proxy: 'US' },
    { url: 'https://competitor.com/category-B', proxy: 'GB' }
], {
    concurrency_limit: 500,
    retry_policy: 'exponential_backoff',
    stealth_profile: 'aggresive_human_emulation'
});

POST /v1/extract/llm

Utilize the `scrapix-70b-vision` un-structured DOM parser. Bypass the need to write fragile CSS selectors. You pass the URL and a JSON Schema, and the Vision Engine translates the raw page into strictly typed data.

{
  "url": "https://complex-spas.com/dynamic-data",
  "render_js": true,
  "schema": {
    "product_name": "String",
    "historical_prices": "Array",
    "is_in_stock": "Boolean"
  },
  "instructions": "Locate the primary item. If stock says 'Backordered', map to false."
}

Webhooks Setup & Security

For operations exceeding 10s (like JS rendering or automated scrolling), synchronous HTTP requests will timeout. Subscribe to our Webhook dispatches to receive data asynchronously as soon as jobs complete.

app.post('/scrapix/webhook', express.raw({ type: 'application/json' }), (req, res) => {
  const signature = req.headers['x-scrapix-signature'];
  
  if (crypto.verifyWebhook(req.body, signature, process.env.WH_SECRET)) {
    const { job_id, scraped_json, status } = JSON.parse(req.body);
    database.saveExtraction(job_id, scraped_json);
    res.sendStatus(200);
  } else {
    res.status(401).send('Invalid signature');
  }
});

GET /v1/status

Verify your asynchronous extraction pipelines or check the global proxy pool health. Poll job IDs generated from the Webhook or Batch endpoints using this interface to retrieve JSON buffers.

curl -X GET "https://api.scrapixdata.io/v1/status?job_id=job_xxxxxxx" \
     -H "Authorization: Bearer sk_live_xxxxxxxxxxx"

Cloudflare Bypass (ASP)

Our Advanced Stealth Protection (ASP) framework utilizes deep kernel-level TCP/IP fingerprint spoofing. By setting `"advanced_stealth_protection": true`, our AI proxy engine statistically synthesizes Ja3/HTTP2 profiles matching real enterprise endpoints to effortlessly pierce Edge CDNs (Akamai, DataDome, Cloudflare Turnstile).

// ASP configuration injected globally into your request payload
{
    "url": "https://protected-site.com",
    "advanced_stealth_protection": true, // Activates Fingerprint Spoofing
    "solve_captchas": ["datadome", "turnstile"] // Hands off token generation to our backend AI logic
}