Web Scraper

Description

The webscraper module allows extracting structured information from HTML content using CSS selectors. It receives HTML as input data (typically from the HTTP module) and extracts elements according to the selector configuration. It supports metadata extraction, CSS selectors for any HTML element, nested sub-elements, and data-attributes. It is ideal for web page scraping, product price extraction, link collection, or HTML content analysis.

Configuration

Parameter	Type	Required	Description
elements	json/array	No	Configuration of elements to extract. If empty, metadata, h1, p, img and links are extracted by default

Elements format

[
  {
    "type": "meta",
    "name": "metadata"
  },
  {
    "type": "selector",
    "selector": "h1",
    "name": "titulos"
  },
  {
    "type": "selector",
    "selector": ".producto",
    "name": "productos",
    "subElements": [
      { "selector": "h2", "name": "nombre" },
      { "selector": ".precio", "name": "precio" },
      { "selector": "img", "name": "imagen" }
    ],
    "includeDataAttributes": true
  }
]

Output

{
  "nextModule": "siguiente_modulo",
  "data": {
    "extractedData": {
      "metadata": {
        "description": "Descripcion del sitio",
        "og:title": "Titulo Open Graph"
      },
      "titulos": [
        { "text": "Titulo Principal" }
      ],
      "productos": [
        {
          "nombre": "Producto A",
          "precio": "29.99",
          "imagen": "/img/producto-a.jpg",
          "dataAttributes": { "data-id": "123" }
        }
      ]
    }
  }
}

Usage Example

Basic case

{
  "elements": [
    { "type": "meta", "name": "metadata" },
    { "type": "selector", "selector": "h1", "name": "titles" },
    { "type": "selector", "selector": "a", "name": "links" }
  ]
}

Notes

The input data must be HTML content (string); typically obtained from the HTTP module with a prior GET
Uses the cheerio library for HTML parsing
Supported element types: meta (extracts meta tags) and selector (CSS selector)
For img elements, the src attribute is extracted; for a, href is extracted; for others, the text
Sub-elements (subElements) allow extracting nested data within a container
includeDataAttributes: true extracts all data-* attributes from the element
If elements is empty or not configured, defaults are extracted: metadata, h1, p, img and links
If a single sub-result is found, it is returned as a direct value instead of an array
Does not require credentials

http (get the HTML from a web page)
checkSite (check a site’s status)