SWIRLS_
GuidesNodes

Scrape

Fetch and extract structured content from web pages for use in downstream processing.

Scrape nodes fetch a web page and extract its content in one or more formats. Use them to pull in articles, documentation, product pages, or other web content for processing by AI, Code, or other downstream nodes.

Configuration

FieldTypeRequiredDescription
urlstringNoThe URL to fetch. Supports placeholders (for example, {{input.articleUrl}}).
onlyMainContentbooleanNoWhen true, extract only the primary content (for example, the article body) and strip navigation, sidebars, ads, and footers. Defaults to false.
formatsstring[]NoOutput formats to return (for example, ["markdown", "html"]). Use this to control how the extracted content is structured.
maxAgenumberNoMaximum age of a cached response, in seconds. When set, the scraper returns a cached result if one exists within this time window, reducing redundant requests.
parsersstring[]NoParser names to apply during content extraction (for example, ["readability"]). Available parsers depend on the configured scraping provider.

Input

The node receives the trigger payload and upstream node outputs. Use placeholders in the url field (and other string fields) to scrape dynamic URLs at runtime -- for example, {{input.link}}.

Output

The node produces the fetched and extracted content in the requested formats (for example, raw HTML, Markdown, or structured data). This output is stored in the node execution record and passed to downstream nodes.

Tips

  • Enable onlyMainContent when you need clean article text for summarization or analysis. This strips boilerplate and keeps token counts manageable for downstream LLM nodes.
  • Set maxAge to avoid redundant fetches of the same URL and to stay within rate limits.
  • Always respect the target site's robots.txt and terms of service. Use scraping only for content you are authorized to access.
  • If the target page requires authentication or presents anti-bot measures, consider using an HTTP node with custom headers instead.
  • HTTP -- For direct API calls or when you need full control over request headers and authentication.
  • AI -- Summarize, analyze, or extract information from scraped content.
  • Code -- Parse or transform scraped content before passing it downstream.

On this page