GuidesNodes
Scrape
Fetch and extract structured content from web pages for use in downstream processing.
Scrape nodes fetch a web page and extract its content in one or more formats. Use them to pull in articles, documentation, product pages, or other web content for processing by AI, Code, or other downstream nodes.
Configuration
| Field | Type | Required | Description |
|---|---|---|---|
| url | string | No | The URL to fetch. Supports placeholders (for example, {{input.articleUrl}}). |
| onlyMainContent | boolean | No | When true, extract only the primary content (for example, the article body) and strip navigation, sidebars, ads, and footers. Defaults to false. |
| formats | string[] | No | Output formats to return (for example, ["markdown", "html"]). Use this to control how the extracted content is structured. |
| maxAge | number | No | Maximum age of a cached response, in seconds. When set, the scraper returns a cached result if one exists within this time window, reducing redundant requests. |
| parsers | string[] | No | Parser names to apply during content extraction (for example, ["readability"]). Available parsers depend on the configured scraping provider. |
Input
The node receives the trigger payload and upstream node outputs. Use placeholders in the url field (and other string fields) to scrape dynamic URLs at runtime -- for example, {{input.link}}.
Output
The node produces the fetched and extracted content in the requested formats (for example, raw HTML, Markdown, or structured data). This output is stored in the node execution record and passed to downstream nodes.
Tips
- Enable
onlyMainContentwhen you need clean article text for summarization or analysis. This strips boilerplate and keeps token counts manageable for downstream LLM nodes. - Set
maxAgeto avoid redundant fetches of the same URL and to stay within rate limits. - Always respect the target site's
robots.txtand terms of service. Use scraping only for content you are authorized to access. - If the target page requires authentication or presents anti-bot measures, consider using an HTTP node with custom headers instead.