gtmlp

package module
v0.0.0-...-56afe49 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 5, 2026 License: MIT Imports: 20 Imported by: 0

README

GTMLP

Type-safe HTML scraping with XPath selectors and external configuration.

Features

  • Type-safe with Go generics
  • External config (JSON/YAML)
  • XPath validation before scraping
  • Pagination support - Auto-follow next-link or numbered pagination
  • Fallback XPath chains (altXpath, altContainer) for handling varying HTML structures
  • Data transformation pipes (trim, int/float conversion, regex, URL parsing, etc.)
  • Custom pipe registration for domain-specific transformations
  • Structured logging with configurable levels (slog-based)
  • SSRF protection - Blocks private IPs by default
  • Health checks for URLs
  • Production ready (retries, proxy, timeouts)

Installation

go get github.com/Hanivan/gtmlp

Quick Start

selectors.json:

{
  "container": "//div[@class='product']",
  "fields": {
    "name": {"xpath": ".//h2/text()"},
    "price": {"xpath": ".//span[@class='price']/text()", "pipes": ["trim", "tofloat"]}
  }
}

main.go:

type Product struct {
    Name  string  `json:"name"`
    Price float64 `json:"price"`
}

config, _ := gtmlp.LoadConfig("selectors.json", nil)
products, _ := gtmlp.ScrapeURL[Product](context.Background(), "https://example.com", config)

for _, p := range products {
    fmt.Printf("%s: %.2f\n", p.Name, p.Price)
}

Or embed config with go:embed:

//go:embed selectors.yaml
var configYAML string

config, _ := gtmlp.ParseConfig(configYAML, gtmlp.FormatYAML, nil)
products, _ := gtmlp.ScrapeURL[Product](context.Background(), "https://example.com", config)

Logging

Configure log levels for different environments:

import "log/slog"

// Development: see HTTP requests and scraping details
gtmlp.SetLogLevel(slog.LevelInfo)

// Troubleshooting: see XPath evaluation and fallbacks
gtmlp.SetLogLevel(slog.LevelDebug)

// Production: warnings and errors only (default)
gtmlp.SetLogLevel(slog.LevelWarn)

// Custom handler (JSON format, custom writer, etc.)
handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelInfo})
gtmlp.SetLogger(slog.New(handler))

Log levels:

  • Debug - XPath evaluation, fallback usage, field extraction
  • Info - HTTP requests, scraping progress, pagination
  • Warn - Fallback usage, HTTP warnings, duplicate URLs (default)
  • Error - HTTP failures, parsing errors, validation failures

Security

GTMLP includes built-in SSRF (Server-Side Request Forgery) protection:

config := &gtmlp.Config{
    Container: "//div[@class='product']",
    Fields:    fields,

    // SSRF protection (default: enabled)
    // Blocks: localhost, 127.0.0.1, 10.x.x.x, 192.168.x.x, 169.254.169.254
    AllowPrivateIPs: false, // set true to allow private IPs

    // Custom URL validator
    URLValidator: func(url string) error {
        if !strings.Contains(url, "example.com") {
            return errors.New("domain not allowed")
        }
        return nil
    },
}

See SECURITY.md for security best practices.

Usage

// Load config from file
config, _ := gtmlp.LoadConfig("selectors.yaml", nil)

// Or embed with go:embed
//go:embed selectors.yaml
var yaml string
config, _ := gtmlp.ParseConfig(yaml, gtmlp.FormatYAML, nil)

// Scrape
products, _ := gtmlp.ScrapeURL[Product](context.Background(), url, config)
results, _ := gtmlp.ScrapeURLUntyped(context.Background(), url, config) // returns []map[string]any

Environment Variables

export GTMLP_TIMEOUT=30s
export GTMLP_USER_AGENT=MyBot/1.0
export GTMLP_PROXY=http://proxy:8080

Fallback XPath Chains

Handle varying HTML structures with altXpath and altContainer:

{
  "container": "//div[@class='product']",
  "altContainer": ["//article[@class='product']", "//div[@class='item']"],
  "fields": {
    "name": {
      "xpath": ".//h2/text()",
      "altXpath": [".//h3/text()", ".//h1/text()"]
    },
    "price": {
      "xpath": ".//span[@class='price']/text()",
      "altXpath": [".//div[@class='price']/text()"],
      "pipes": ["trim", "tofloat"]
    }
  }
}

How it works:

  • Tries primary XPath first
  • If empty (after pipes), tries each altXpath in order
  • Returns first non-empty result
  • Container fallback works the same way with altContainer

Pagination

Auto-follow pagination or extract URLs for manual control:

Next-Link Pagination (follow "Next" buttons):

{
  "container": "//div[@class='product']",
  "fields": {
    "name": {"xpath": ".//h2/text()"}
  },
  "pagination": {
    "type": "next-link",
    "nextSelector": "//a[@rel='next']/@href",
    "altSelectors": ["//a[contains(text(), 'Next')]/@href"],
    "maxPages": 50
  }
}

Numbered Pagination (extract all page links):

{
  "pagination": {
    "type": "numbered",
    "pageSelector": "//div[@class='pagination']//a/@href",
    "maxPages": 20
  }
}

Usage:

// Auto-follow: returns combined results from all pages
products, _ := gtmlp.ScrapeURL[Product](ctx, url, config)

// Page-separated: get results per page with metadata
results, _ := gtmlp.ScrapeURLWithPages[Product](ctx, url, config)

// Extract-only: get URLs for manual control
info, _ := gtmlp.ExtractPaginationURLs(ctx, url, config)

Data Transformation Pipes

Transform extracted data using pipes:

{
  "container": "//div[@class='product']",
  "fields": {
    "name": {"xpath": ".//h2/text()", "pipes": ["trim"]},
    "price": {"xpath": ".//span[@class='price']/text()", "pipes": ["trim", "tofloat"]},
    "url": {"xpath": ".//a/@href", "pipes": ["parseurl"]}
  }
}

Built-in pipes:

  • trim - Remove whitespace
  • toint - Convert to integer (strips $, ,)
  • tofloat - Convert to float (strips $, ,)
  • parseurl - Convert relative URLs to absolute
  • parsetime:layout:timezone - Parse datetime
  • regexreplace:pattern:replacement:flags - Regex substitution
  • humanduration - Convert seconds to "X minutes ago"

Custom pipes:

gtmlp.RegisterPipe("uppercase", func(ctx context.Context, input string, params []string) (any, error) {
    return strings.ToUpper(input), nil
})

See docs/API_V2.md for complete pipe documentation.

Documentation & Examples

  • API_V2.md - Complete API reference
  • examples/v2/ - 10 working examples:
    • Basic scraping (JSON/YAML, embed)
    • E-commerce and tables
    • Pagination (next-link, numbered)

License

MIT

Documentation

Index

Constants

View Source
const (
	DefaultMaxPages          = 100
	DefaultPaginationTimeout = 10 * time.Minute
)

Default pagination configuration

Variables

View Source
var DefaultEnvMapping = &EnvMapping{
	Timeout:    "GTMLP_TIMEOUT",
	UserAgent:  "GTMLP_USER_AGENT",
	RandomUA:   "GTMLP_RANDOM_UA",
	MaxRetries: "GTMLP_MAX_RETRIES",
	Proxy:      "GTMLP_PROXY",
}

DefaultEnvMapping provides default env var names

Functions

func GetLogger

func GetLogger() *slog.Logger

GetLogger returns the current global logger Useful for testing and debugging

func Is

func Is(err error, errorType ErrorType) bool

Is checks if error is of specific type

func RegisterPipe

func RegisterPipe(name string, fn PipeFunc)

RegisterPipe registers a custom pipe function

func Scrape

func Scrape[T any](ctx context.Context, html string, config *Config) ([]T, error)

Scrape extracts data from HTML using XPath with a typed result. It finds all container nodes and extracts fields from each one. Returns an empty slice if no containers are found.

func ScrapeURL

func ScrapeURL[T any](ctx context.Context, url string, config *Config) ([]T, error)

ScrapeURL fetches a URL and scrapes it with config (typed)

func ScrapeURLUntyped

func ScrapeURLUntyped(ctx context.Context, url string, config *Config) ([]map[string]any, error)

ScrapeURLUntyped fetches a URL and scrapes it, returning maps (no type parameter)

func ScrapeUntyped

func ScrapeUntyped(ctx context.Context, html string, config *Config) ([]map[string]any, error)

ScrapeUntyped extracts data from HTML using XPath, returning map slices. It finds all container nodes and extracts fields from each one. Returns an empty slice if no containers are found.

func SetLogLevel

func SetLogLevel(level slog.Level)

SetLogLevel changes the global log level by creating a new default handler Available levels: slog.LevelDebug, slog.LevelInfo, slog.LevelWarn, slog.LevelError

Default: slog.LevelWarn (production-safe)

Note: This recreates the handler with default settings (TextHandler to stderr). If you're using a custom handler (custom writer, JSON format, etc.), use SetLogger instead.

Example:

// Development: enable Info logs
gtmlp.SetLogLevel(slog.LevelInfo)

// Troubleshooting: enable Debug logs
gtmlp.SetLogLevel(slog.LevelDebug)

// Production: use default Warn level (no call needed)

// For custom handlers, use SetLogger:
handler := slog.NewJSONHandler(myWriter, &slog.HandlerOptions{Level: slog.LevelDebug})
gtmlp.SetLogger(slog.New(handler))

func SetLogger

func SetLogger(logger *slog.Logger)

SetLogger configures the global logger Use this to customize the logger handler (JSON vs Text, output destination, etc.)

Example:

handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
    Level: slog.LevelInfo,
})
gtmlp.SetLogger(slog.New(handler))

func ValidateXPath

func ValidateXPath(html string, xpaths map[string]string) map[string]ValidationResult

ValidateXPath validates XPath expressions against HTML

func ValidateXPathURL

func ValidateXPathURL(url string, config *Config) (map[string]ValidationResult, error)

ValidateXPathURL validates XPath expressions from a URL

func WithURL

func WithURL(ctx context.Context, url string) context.Context

WithURL adds the base URL to context for parseUrl pipe

Types

type Config

type Config struct {
	// XPath definitions
	Container    string                 // Repeating element selector
	AltContainer []string               // Alternative container selectors
	Fields       map[string]FieldConfig // Field name → FieldConfig

	// Pagination
	Pagination *PaginationConfig // Optional pagination configuration

	// Security options
	URLValidator    func(string) error // Optional custom URL validation function
	AllowPrivateIPs bool               // Allow scraping private/internal IPs (default: false)

	// HTTP options
	Timeout    time.Duration
	UserAgent  string
	RandomUA   bool
	MaxRetries int
	Proxy      string
	Headers    map[string]string
}

Config holds scraping configuration

func LoadConfig

func LoadConfig(path string, envMapping *EnvMapping) (*Config, error)

LoadConfig loads selector config from file (JSON/YAML auto-detected)

func ParseConfig

func ParseConfig(data string, format ConfigFormat, envMapping *EnvMapping) (*Config, error)

ParseConfig parses config from string

func (*Config) Validate

func (c *Config) Validate() error

Validate validates the config

type ConfigFormat

type ConfigFormat string

ConfigFormat specifies file format

const (
	FormatJSON ConfigFormat = "json"
	FormatYAML ConfigFormat = "yaml"
)

type EnvMapping

type EnvMapping struct {
	Timeout    string
	UserAgent  string
	RandomUA   string
	MaxRetries string
	Proxy      string
}

EnvMapping defines configurable environment variable names

type ErrorType

type ErrorType string

ErrorType represents the category of error

const (
	ErrTypeNetwork    ErrorType = "network"
	ErrTypeParsing    ErrorType = "parsing"
	ErrTypeXPath      ErrorType = "xpath"
	ErrTypeConfig     ErrorType = "config"
	ErrTypeValidation ErrorType = "validation"
	ErrTypePipe       ErrorType = "pipe"
)

type FieldConfig

type FieldConfig struct {
	XPath    string
	AltXPath []string
	Pipes    []string
}

FieldConfig defines a single field's XPath and optional pipes

type HealthCheckResult

type HealthCheckResult struct {
	URL     string        // The URL that was checked
	Status  HealthStatus  // The health status of the URL
	Code    int           // HTTP status code (0 if error occurred)
	Latency time.Duration // Time taken for the health check
	Error   error         // Error message if check failed
}

HealthCheckResult represents the result of a health check

func CheckHealth

func CheckHealth(url string) HealthCheckResult

CheckHealth performs a health check on a single URL

func CheckHealthMulti

func CheckHealthMulti(urls []string) []HealthCheckResult

CheckHealthMulti performs health checks on multiple URLs concurrently

func CheckHealthWithOptions

func CheckHealthWithOptions(url string, config *Config) HealthCheckResult

CheckHealthWithOptions performs a health check on a single URL with custom configuration

type HealthStatus

type HealthStatus int

HealthStatus represents the health status of a URL

const (
	// StatusHealthy indicates the URL returned a 2xx status code
	StatusHealthy HealthStatus = iota
	// StatusUnhealthy indicates the URL returned a 4xx or 5xx status code
	StatusUnhealthy
	// StatusError indicates there was a network or other error
	StatusError
)

func (HealthStatus) String

func (s HealthStatus) String() string

String returns the string representation of HealthStatus

type PageResult

type PageResult[T any] struct {
	URL       string
	PageNum   int
	Items     []T
	ScrapedAt time.Time
}

PageResult contains results from a single page

type PaginatedResults

type PaginatedResults[T any] struct {
	Pages      []PageResult[T]
	TotalPages int
	TotalItems int
}

PaginatedResults contains page-separated scraping results

func ScrapeURLWithPages

func ScrapeURLWithPages[T any](ctx context.Context, url string, config *Config) (*PaginatedResults[T], error)

ScrapeURLWithPages fetches a URL and scrapes it with pagination, returning page-separated results

type PaginationConfig

type PaginationConfig struct {
	Type         string        // "next-link" or "numbered"
	NextSelector string        // XPath for next link (next-link type)
	AltSelectors []string      // Fallback selectors for next link
	PageSelector string        // XPath for all page links (numbered type)
	Pipes        []string      // URL transformation pipes
	MaxPages     int           // Maximum pages to scrape (default: 100)
	Timeout      time.Duration // Total pagination timeout (default: 10m)
}

PaginationConfig defines pagination behavior

type PaginationError

type PaginationError struct {
	PageURL      string // URL that failed
	PageNumber   int    // Page number (1-indexed)
	PartialData  any    // Items scraped before failure
	TotalScraped int    // Total items before failure
	Cause        error  // Underlying error
}

PaginationError represents an error during pagination

func (*PaginationError) Error

func (e *PaginationError) Error() string

type PaginationInfo

type PaginationInfo struct {
	URLs    []string // All discovered page URLs
	Type    string   // "next-link" or "numbered"
	BaseURL string   // Original base URL
}

PaginationInfo contains extracted pagination URLs

func ExtractPaginationURLs

func ExtractPaginationURLs(ctx context.Context, url string, config *Config) (*PaginationInfo, error)

ExtractPaginationURLs extracts all pagination URLs without scraping

type PartialResult

type PartialResult[T any] struct {
	Data   []T
	Errors map[string]error
}

PartialResult contains data and field-level errors

type PipeError

type PipeError struct {
	PipeName string
	Input    string
	Params   []string
	Cause    error
}

PipeError represents an error that occurred during pipe transformation

func (*PipeError) Error

func (e *PipeError) Error() string

func (*PipeError) Unwrap

func (e *PipeError) Unwrap() error

type PipeFunc

type PipeFunc func(ctx context.Context, input string, params []string) (any, error)

PipeFunc defines a pipe transformation function

type ScrapeError

type ScrapeError struct {
	Type    ErrorType
	Message string
	XPath   string
	URL     string
	Cause   error
}

ScrapeError is a typed error with context

func (*ScrapeError) Error

func (e *ScrapeError) Error() string

func (*ScrapeError) Unwrap

func (e *ScrapeError) Unwrap() error

type ValidationResult

type ValidationResult struct {
	XPath      string
	Valid      bool
	MatchCount int
	Error      error
}

ValidationResult represents XPath validation result for config-based validation

Directories

Path Synopsis
examples
v2/basic_json command
v2/basic_yaml command
v2/embed_json command
v2/embed_yaml command
v2/table_json command
v2/table_yaml command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL