Web Scraper
Description
Section titled “Description”The webscraper module allows extracting structured information from HTML content using CSS selectors. It receives HTML as input data (typically from the HTTP module) and extracts elements according to the selector configuration. It supports metadata extraction, CSS selectors for any HTML element, nested sub-elements, and data-attributes. It is ideal for web page scraping, product price extraction, link collection, or HTML content analysis.
Configuration
Section titled “Configuration”| Parameter | Type | Required | Description |
|---|---|---|---|
| elements | json/array | No | Configuration of elements to extract. If empty, metadata, h1, p, img and links are extracted by default |
Elements format
Section titled “Elements format”[ { "type": "meta", "name": "metadata" }, { "type": "selector", "selector": "h1", "name": "titulos" }, { "type": "selector", "selector": ".producto", "name": "productos", "subElements": [ { "selector": "h2", "name": "nombre" }, { "selector": ".precio", "name": "precio" }, { "selector": "img", "name": "imagen" } ], "includeDataAttributes": true }]Output
Section titled “Output”{ "nextModule": "siguiente_modulo", "data": { "extractedData": { "metadata": { "description": "Descripcion del sitio", "og:title": "Titulo Open Graph" }, "titulos": [ { "text": "Titulo Principal" } ], "productos": [ { "nombre": "Producto A", "precio": "29.99", "imagen": "/img/producto-a.jpg", "dataAttributes": { "data-id": "123" } } ] } }}Usage Example
Section titled “Usage Example”Basic case
Section titled “Basic case”{ "elements": [ { "type": "meta", "name": "metadata" }, { "type": "selector", "selector": "h1", "name": "titles" }, { "type": "selector", "selector": "a", "name": "links" } ]}- The input data must be HTML content (string); typically obtained from the HTTP module with a prior GET
- Uses the cheerio library for HTML parsing
- Supported element types:
meta(extracts meta tags) andselector(CSS selector) - For
imgelements, thesrcattribute is extracted; fora,hrefis extracted; for others, the text - Sub-elements (
subElements) allow extracting nested data within a container includeDataAttributes: trueextracts alldata-*attributes from the element- If
elementsis empty or not configured, defaults are extracted: metadata, h1, p, img and links - If a single sub-result is found, it is returned as a direct value instead of an array
- Does not require credentials
Related Nodes
Section titled “Related Nodes”- http (get the HTML from a web page)
- checkSite (check a site’s status)