Skip to content

Web Scraper

The webscraper module allows extracting structured information from HTML content using CSS selectors. It receives HTML as input data (typically from the HTTP module) and extracts elements according to the selector configuration. It supports metadata extraction, CSS selectors for any HTML element, nested sub-elements, and data-attributes. It is ideal for web page scraping, product price extraction, link collection, or HTML content analysis.

ParameterTypeRequiredDescription
elementsjson/arrayNoConfiguration of elements to extract. If empty, metadata, h1, p, img and links are extracted by default
[
{
"type": "meta",
"name": "metadata"
},
{
"type": "selector",
"selector": "h1",
"name": "titulos"
},
{
"type": "selector",
"selector": ".producto",
"name": "productos",
"subElements": [
{ "selector": "h2", "name": "nombre" },
{ "selector": ".precio", "name": "precio" },
{ "selector": "img", "name": "imagen" }
],
"includeDataAttributes": true
}
]
{
"nextModule": "siguiente_modulo",
"data": {
"extractedData": {
"metadata": {
"description": "Descripcion del sitio",
"og:title": "Titulo Open Graph"
},
"titulos": [
{ "text": "Titulo Principal" }
],
"productos": [
{
"nombre": "Producto A",
"precio": "29.99",
"imagen": "/img/producto-a.jpg",
"dataAttributes": { "data-id": "123" }
}
]
}
}
}
{
"elements": [
{ "type": "meta", "name": "metadata" },
{ "type": "selector", "selector": "h1", "name": "titles" },
{ "type": "selector", "selector": "a", "name": "links" }
]
}
  • The input data must be HTML content (string); typically obtained from the HTTP module with a prior GET
  • Uses the cheerio library for HTML parsing
  • Supported element types: meta (extracts meta tags) and selector (CSS selector)
  • For img elements, the src attribute is extracted; for a, href is extracted; for others, the text
  • Sub-elements (subElements) allow extracting nested data within a container
  • includeDataAttributes: true extracts all data-* attributes from the element
  • If elements is empty or not configured, defaults are extracted: metadata, h1, p, img and links
  • If a single sub-result is found, it is returned as a direct value instead of an array
  • Does not require credentials
  • http (get the HTML from a web page)
  • checkSite (check a site’s status)