Scrape a URL (ephemeral browser)

curl --request POST \ --url https://api.webcompute.dev/v1/scrape \ --header 'Authorization: Bearer <token>' \ --header 'Content-Type: application/json' \ --data ' { "url": "<string>", "proxy": "<string>", "waitFor": "<string>", "timeout": 30000, "policy": {}, "format": "markdown", "selector": "<string>", "iframe": "<string>", "solveCaptcha": false, "mode": "structured", "extract": "tree", "maxChars": 20000, "startFromChar": 5000000, "readFrom": "start", "include": [], "includeLinks": false, "iframeText": false } '

{ "v": 1, "url": "https://example.com/articles", "title": "Example Domain", "format": "markdown", "content": "# Example Domain\n\nThis domain is for use in illustrative examples.", "provenance": {}, "diagnostics": {}, "completeness": { "ratio": 0.97, "complete": true, "capLimited": false, "pruningLimited": false }, "statusCode": 200, "captchaSolved": false, "captchaDetected": "cloudflare-turnstile", "captchaState": "solved", "elapsedMs": 842, "tree": [ { "depth": 0, "text": "Section heading", "children": "<array>" } ], "description": "Short summary surfaced under mode=summary.", "links": [ { "text": "Pricing", "href": "https://example.com/pricing" } ], "forms": [ { "valuePresent": false, "name": "email", "label": "Email address", "placeholder": "you@example.com", "type": "email" } ], "headings": [ { "level": 2, "text": "Features" } ], "captcha": [ { "type": "cloudflare-turnstile", "category": "blocking", "state": "solved", "sitekey": "0x4AAAAAAABbbbcccc", "pageUrl": "https://example.com/login", "frameId": "7A2A1B9DDEF4B8B9A6E99D8A7A65DCEE", "detectedAt": 1713890000123, "resolvedAt": 1713890001456, "method": "auto-wait", "elapsedMs": 333 } ], "warnings": [ "scrape-truncated:recommendedMaxChars=30000" ], "truncated": true, "charsOmitted": 1530, "nodesOmitted": 42, "pruningRatio": 0.38 }

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

url

string

required

Target URL to navigate to

proxy

string

Proxy URL (protocol://user:pass@host:port)

waitFor

string

CSS selector to wait for before action

timeout

number

default:30000

Action timeout in milliseconds (1-60000)

policy

object

Browser navigation policy for this action

format

enum<string>

default:markdown

Content output format

Available options:

html,

markdown,

text

selector

string

CSS selector scoping deterministic content reads. Omit to let the scraper pick a semantic landmark or fall back to body with landmark pruning.

iframe

string

CSS selector of a specific iframe to scrape.

solveCaptcha

boolean

default:false

Request passive Phase 1 CAPTCHA auto-resolution before extracting content

mode

enum<string>

default:structured

Output shape. 'summary' emits short content (<=1200 chars) plus description; 'structured' emits the visible content payload with optional tree[] when extract='tree'; 'raw' opts out of the maxChars clamp.

Available options:

summary,

structured,

raw

extract

enum<string> | null

Opt-in content extractor. 'tree' emits hierarchical visible content (comment threads, file trees, nested menus). Use runtime extract for repeated records.

Available options:

tree

maxChars

integer

default:20000

Soft cap on response content character count. Clamps above default trigger truncated=true + charsOmitted in the response.

Required range: 1 <= x <= 50000

startFromChar

integer

Zero-based content character offset for continuing a truncated read.

Required range: 0 <= x <= 10000000

readFrom

enum<string>

default:start

Read window direction. Use end for latest/bottom content; startFromChar is ignored in end-window mode.

Available options:

start,

end

include

enum<string>[]

Opt-in extras added to the lean default scrape response. When supplied, this exact list controls optional categories; include: [] explicitly opts out of all extras. Security signals (captchaDetected/captchaState/captchaSolved), structural metadata, and diagnostics (truncated/etc.) are always populated regardless. Duplicates are collapsed.

Available options:

links,

forms,

headings,

captcha

includeLinks

boolean

default:false

Include deterministic anchor links in read output.

iframeText

boolean

default:false

When true and format is not 'html', append iframe text to content.

Response

Scrape result with deterministic content. Response always includes completeness.

enum<number>

required

Available options:

1

Example:

1

url

string

required

Example:

"https://example.com/articles"

title

string

required

Example:

"Example Domain"

format

enum<string>

required

Available options:

html,

markdown,

text

Example:

"markdown"

content

string

required

Example:

"# Example Domain\n\nThis domain is for use in illustrative examples."

provenance

object

required

diagnostics

object

required

completeness

object

required

Show child attributes

statusCode

number

required

Example:

200

captchaSolved

boolean

required

Example:

false

captchaDetected

enum<string> | null

required

Available options:

recaptcha-v2,

recaptcha-v3,

hcaptcha,

cloudflare-turnstile,

cloudflare-challenge,

datadome-captcha,

aws-waf-captcha,

geetest,

waf,

unknown

Example:

"cloudflare-turnstile"

captchaState

enum<string> | null

required

Available options:

detected,

solving,

solved,

failed,

timeout,

cancelled,

expired,

interactive_required

Example:

"solved"

elapsedMs

number

required

Example:

842

tree

object[]

Show child attributes

description

string

Example:

"Short summary surfaced under mode=summary."

links

object[]

Show child attributes

forms

object[]

Show child attributes

headings

object[]

Show child attributes

captcha

object[]

Show child attributes

warnings

string[]

Example:

[
  "scrape-truncated:recommendedMaxChars=30000"
]

truncated

boolean

Example:

true

charsOmitted

number

Example:

1530

nodesOmitted

number

Example:

42

pruningRatio

number

Example:

0.38