gtmlp

package module

v0.0.0-...-56afe49 Latest Latest Go to latest Published: Feb 5, 2026 License: MIT Imports: 20 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/Hanivan/gtmlp

Links

Open Source Insights

README ¶

GTMLP

Type-safe HTML scraping with XPath selectors and external configuration.

Features

Type-safe with Go generics
External config (JSON/YAML)
XPath validation before scraping
Pagination support - Auto-follow next-link or numbered pagination
Fallback XPath chains (altXpath, altContainer) for handling varying HTML structures
Data transformation pipes (trim, int/float conversion, regex, URL parsing, etc.)
Custom pipe registration for domain-specific transformations
Structured logging with configurable levels (slog-based)
SSRF protection - Blocks private IPs by default
Health checks for URLs
Production ready (retries, proxy, timeouts)

Installation

go get github.com/Hanivan/gtmlp

Quick Start

selectors.json:

{
  "container": "//div[@class='product']",
  "fields": {
    "name": {"xpath": ".//h2/text()"},
    "price": {"xpath": ".//span[@class='price']/text()", "pipes": ["trim", "tofloat"]}
  }
}

main.go:

type Product struct {
    Name  string  `json:"name"`
    Price float64 `json:"price"`
}

config, _ := gtmlp.LoadConfig("selectors.json", nil)
products, _ := gtmlp.ScrapeURL[Product](context.Background(), "https://example.com", config)

for _, p := range products {
    fmt.Printf("%s: %.2f\n", p.Name, p.Price)
}

Or embed config with go:embed:

//go:embed selectors.yaml
var configYAML string

config, _ := gtmlp.ParseConfig(configYAML, gtmlp.FormatYAML, nil)
products, _ := gtmlp.ScrapeURL[Product](context.Background(), "https://example.com", config)

Logging

Configure log levels for different environments:

import "log/slog"

// Development: see HTTP requests and scraping details
gtmlp.SetLogLevel(slog.LevelInfo)

// Troubleshooting: see XPath evaluation and fallbacks
gtmlp.SetLogLevel(slog.LevelDebug)

// Production: warnings and errors only (default)
gtmlp.SetLogLevel(slog.LevelWarn)

// Custom handler (JSON format, custom writer, etc.)
handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelInfo})
gtmlp.SetLogger(slog.New(handler))

Log levels:

Debug - XPath evaluation, fallback usage, field extraction
Info - HTTP requests, scraping progress, pagination
Warn - Fallback usage, HTTP warnings, duplicate URLs (default)
Error - HTTP failures, parsing errors, validation failures

Security

GTMLP includes built-in SSRF (Server-Side Request Forgery) protection:

config := &gtmlp.Config{
    Container: "//div[@class='product']",
    Fields:    fields,

    // SSRF protection (default: enabled)
    // Blocks: localhost, 127.0.0.1, 10.x.x.x, 192.168.x.x, 169.254.169.254
    AllowPrivateIPs: false, // set true to allow private IPs

    // Custom URL validator
    URLValidator: func(url string) error {
        if !strings.Contains(url, "example.com") {
            return errors.New("domain not allowed")
        }
        return nil
    },
}

See SECURITY.md for security best practices.

Usage

// Load config from file
config, _ := gtmlp.LoadConfig("selectors.yaml", nil)

// Or embed with go:embed
//go:embed selectors.yaml
var yaml string
config, _ := gtmlp.ParseConfig(yaml, gtmlp.FormatYAML, nil)

// Scrape
products, _ := gtmlp.ScrapeURL[Product](context.Background(), url, config)
results, _ := gtmlp.ScrapeURLUntyped(context.Background(), url, config) // returns []map[string]any

Environment Variables

export GTMLP_TIMEOUT=30s
export GTMLP_USER_AGENT=MyBot/1.0
export GTMLP_PROXY=http://proxy:8080

Fallback XPath Chains

Handle varying HTML structures with altXpath and altContainer:

{
  "container": "//div[@class='product']",
  "altContainer": ["//article[@class='product']", "//div[@class='item']"],
  "fields": {
    "name": {
      "xpath": ".//h2/text()",
      "altXpath": [".//h3/text()", ".//h1/text()"]
    },
    "price": {
      "xpath": ".//span[@class='price']/text()",
      "altXpath": [".//div[@class='price']/text()"],
      "pipes": ["trim", "tofloat"]
    }
  }
}

How it works:

Tries primary XPath first
If empty (after pipes), tries each altXpath in order
Returns first non-empty result
Container fallback works the same way with altContainer

Pagination

Auto-follow pagination or extract URLs for manual control:

Next-Link Pagination (follow "Next" buttons):

{
  "container": "//div[@class='product']",
  "fields": {
    "name": {"xpath": ".//h2/text()"}
  },
  "pagination": {
    "type": "next-link",
    "nextSelector": "//a[@rel='next']/@href",
    "altSelectors": ["//a[contains(text(), 'Next')]/@href"],
    "maxPages": 50
  }
}

Numbered Pagination (extract all page links):

{
  "pagination": {
    "type": "numbered",
    "pageSelector": "//div[@class='pagination']//a/@href",
    "maxPages": 20
  }
}

Usage:

// Auto-follow: returns combined results from all pages
products, _ := gtmlp.ScrapeURL[Product](ctx, url, config)

// Page-separated: get results per page with metadata
results, _ := gtmlp.ScrapeURLWithPages[Product](ctx, url, config)

// Extract-only: get URLs for manual control
info, _ := gtmlp.ExtractPaginationURLs(ctx, url, config)

Data Transformation Pipes

Transform extracted data using pipes:

{
  "container": "//div[@class='product']",
  "fields": {
    "name": {"xpath": ".//h2/text()", "pipes": ["trim"]},
    "price": {"xpath": ".//span[@class='price']/text()", "pipes": ["trim", "tofloat"]},
    "url": {"xpath": ".//a/@href", "pipes": ["parseurl"]}
  }
}

Built-in pipes:

trim - Remove whitespace
toint - Convert to integer (strips $, ,)
tofloat - Convert to float (strips $, ,)
parseurl - Convert relative URLs to absolute
parsetime:layout:timezone - Parse datetime
regexreplace:pattern:replacement:flags - Regex substitution
humanduration - Convert seconds to "X minutes ago"

Custom pipes:

gtmlp.RegisterPipe("uppercase", func(ctx context.Context, input string, params []string) (any, error) {
    return strings.ToUpper(input), nil
})

See docs/API_V2.md for complete pipe documentation.

Documentation & Examples

API_V2.md - Complete API reference
examples/v2/ - 10 working examples:
- Basic scraping (JSON/YAML, embed)
- E-commerce and tables
- Pagination (next-link, numbered)

License

MIT

Documentation ¶

Index ¶

Constants
Variables
func GetLogger() *slog.Logger
func Is(err error, errorType ErrorType) bool
func RegisterPipe(name string, fn PipeFunc)
func Scrape[T any](ctx context.Context, html string, config *Config) ([]T, error)
func ScrapeURL[T any](ctx context.Context, url string, config *Config) ([]T, error)
func ScrapeURLUntyped(ctx context.Context, url string, config *Config) ([]map[string]any, error)
func ScrapeUntyped(ctx context.Context, html string, config *Config) ([]map[string]any, error)
func SetLogLevel(level slog.Level)
func SetLogger(logger *slog.Logger)
func ValidateXPath(html string, xpaths map[string]string) map[string]ValidationResult
func ValidateXPathURL(url string, config *Config) (map[string]ValidationResult, error)
func WithURL(ctx context.Context, url string) context.Context
type Config
- func LoadConfig(path string, envMapping *EnvMapping) (*Config, error)
- func ParseConfig(data string, format ConfigFormat, envMapping *EnvMapping) (*Config, error)
- func (c *Config) Validate() error
type ConfigFormat
type EnvMapping
type ErrorType
type FieldConfig
type HealthCheckResult
- func CheckHealth(url string) HealthCheckResult
- func CheckHealthMulti(urls []string) []HealthCheckResult
- func CheckHealthWithOptions(url string, config *Config) HealthCheckResult
type HealthStatus
- func (s HealthStatus) String() string
type PageResult
type PaginatedResults
- func ScrapeURLWithPages[T any](ctx context.Context, url string, config *Config) (*PaginatedResults[T], error)
type PaginationConfig
type PaginationError
- func (e *PaginationError) Error() string
type PaginationInfo
- func ExtractPaginationURLs(ctx context.Context, url string, config *Config) (*PaginationInfo, error)
type PartialResult
type PipeError
- func (e *PipeError) Error() string
- func (e *PipeError) Unwrap() error
type PipeFunc
type ScrapeError
- func (e *ScrapeError) Error() string
- func (e *ScrapeError) Unwrap() error
type ValidationResult

Constants ¶

View Source

const (
	DefaultMaxPages          = 100
	DefaultPaginationTimeout = 10 * time.Minute
)

Default pagination configuration

Variables ¶

View Source

var DefaultEnvMapping = &EnvMapping{
	Timeout:    "GTMLP_TIMEOUT",
	UserAgent:  "GTMLP_USER_AGENT",
	RandomUA:   "GTMLP_RANDOM_UA",
	MaxRetries: "GTMLP_MAX_RETRIES",
	Proxy:      "GTMLP_PROXY",
}

DefaultEnvMapping provides default env var names

Functions ¶

func GetLogger ¶

func GetLogger() *slog.Logger

GetLogger returns the current global logger Useful for testing and debugging

func Is ¶

func Is(err error, errorType ErrorType) bool

Is checks if error is of specific type

func RegisterPipe ¶

func RegisterPipe(name string, fn PipeFunc)

RegisterPipe registers a custom pipe function

func Scrape ¶

func Scrape[T any](ctx context.Context, html string, config *Config) ([]T, error)

Scrape extracts data from HTML using XPath with a typed result. It finds all container nodes and extracts fields from each one. Returns an empty slice if no containers are found.

func ScrapeURL ¶

func ScrapeURL[T any](ctx context.Context, url string, config *Config) ([]T, error)

ScrapeURL fetches a URL and scrapes it with config (typed)

func ScrapeURLUntyped ¶

func ScrapeURLUntyped(ctx context.Context, url string, config *Config) ([]map[string]any, error)

ScrapeURLUntyped fetches a URL and scrapes it, returning maps (no type parameter)

func ScrapeUntyped ¶

func ScrapeUntyped(ctx context.Context, html string, config *Config) ([]map[string]any, error)

ScrapeUntyped extracts data from HTML using XPath, returning map slices. It finds all container nodes and extracts fields from each one. Returns an empty slice if no containers are found.

func SetLogLevel ¶

func SetLogLevel(level slog.Level)

SetLogLevel changes the global log level by creating a new default handler Available levels: slog.LevelDebug, slog.LevelInfo, slog.LevelWarn, slog.LevelError

Default: slog.LevelWarn (production-safe)

Note: This recreates the handler with default settings (TextHandler to stderr). If you're using a custom handler (custom writer, JSON format, etc.), use SetLogger instead.

Example:

// Development: enable Info logs
gtmlp.SetLogLevel(slog.LevelInfo)

// Troubleshooting: enable Debug logs
gtmlp.SetLogLevel(slog.LevelDebug)

// Production: use default Warn level (no call needed)

// For custom handlers, use SetLogger:
handler := slog.NewJSONHandler(myWriter, &slog.HandlerOptions{Level: slog.LevelDebug})
gtmlp.SetLogger(slog.New(handler))

func SetLogger ¶

func SetLogger(logger *slog.Logger)

SetLogger configures the global logger Use this to customize the logger handler (JSON vs Text, output destination, etc.)

Example:

handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
    Level: slog.LevelInfo,
})
gtmlp.SetLogger(slog.New(handler))

func ValidateXPath ¶

func ValidateXPath(html string, xpaths map[string]string) map[string]ValidationResult

ValidateXPath validates XPath expressions against HTML

func ValidateXPathURL ¶

func ValidateXPathURL(url string, config *Config) (map[string]ValidationResult, error)

ValidateXPathURL validates XPath expressions from a URL

func WithURL ¶

func WithURL(ctx context.Context, url string) context.Context

WithURL adds the base URL to context for parseUrl pipe

Types ¶

type Config ¶

type Config struct {
	// XPath definitions
	Container    string                 // Repeating element selector
	AltContainer []string               // Alternative container selectors
	Fields       map[string]FieldConfig // Field name → FieldConfig

	// Pagination
	Pagination *PaginationConfig // Optional pagination configuration

	// Security options
	URLValidator    func(string) error // Optional custom URL validation function
	AllowPrivateIPs bool               // Allow scraping private/internal IPs (default: false)

	// HTTP options
	Timeout    time.Duration
	UserAgent  string
	RandomUA   bool
	MaxRetries int
	Proxy      string
	Headers    map[string]string
}

Config holds scraping configuration

func LoadConfig ¶

func LoadConfig(path string, envMapping *EnvMapping) (*Config, error)

LoadConfig loads selector config from file (JSON/YAML auto-detected)

func ParseConfig ¶

func ParseConfig(data string, format ConfigFormat, envMapping *EnvMapping) (*Config, error)

ParseConfig parses config from string

func (*Config) Validate ¶

func (c *Config) Validate() error

Validate validates the config

type ConfigFormat ¶

type ConfigFormat string

ConfigFormat specifies file format

const (
	FormatJSON ConfigFormat = "json"
	FormatYAML ConfigFormat = "yaml"
)

type EnvMapping ¶

type EnvMapping struct {
	Timeout    string
	UserAgent  string
	RandomUA   string
	MaxRetries string
	Proxy      string
}

EnvMapping defines configurable environment variable names

type ErrorType ¶

type ErrorType string

ErrorType represents the category of error

const (
	ErrTypeNetwork    ErrorType = "network"
	ErrTypeParsing    ErrorType = "parsing"
	ErrTypeXPath      ErrorType = "xpath"
	ErrTypeConfig     ErrorType = "config"
	ErrTypeValidation ErrorType = "validation"
	ErrTypePipe       ErrorType = "pipe"
)

type FieldConfig ¶

type FieldConfig struct {
	XPath    string
	AltXPath []string
	Pipes    []string
}

FieldConfig defines a single field's XPath and optional pipes

type HealthCheckResult ¶

type HealthCheckResult struct {
	URL     string        // The URL that was checked
	Status  HealthStatus  // The health status of the URL
	Code    int           // HTTP status code (0 if error occurred)
	Latency time.Duration // Time taken for the health check
	Error   error         // Error message if check failed
}

HealthCheckResult represents the result of a health check

func CheckHealth ¶

func CheckHealth(url string) HealthCheckResult

CheckHealth performs a health check on a single URL

func CheckHealthMulti ¶

func CheckHealthMulti(urls []string) []HealthCheckResult

CheckHealthMulti performs health checks on multiple URLs concurrently

func CheckHealthWithOptions ¶

func CheckHealthWithOptions(url string, config *Config) HealthCheckResult

CheckHealthWithOptions performs a health check on a single URL with custom configuration

type HealthStatus ¶

type HealthStatus int

HealthStatus represents the health status of a URL

const (
	// StatusHealthy indicates the URL returned a 2xx status code
	StatusHealthy HealthStatus = iota
	// StatusUnhealthy indicates the URL returned a 4xx or 5xx status code
	StatusUnhealthy
	// StatusError indicates there was a network or other error
	StatusError
)

func (HealthStatus) String ¶

func (s HealthStatus) String() string

String returns the string representation of HealthStatus

type PageResult ¶

type PageResult[T any] struct {
	URL       string
	PageNum   int
	Items     []T
	ScrapedAt time.Time
}

PageResult contains results from a single page

type PaginatedResults ¶

type PaginatedResults[T any] struct {
	Pages      []PageResult[T]
	TotalPages int
	TotalItems int
}

PaginatedResults contains page-separated scraping results

func ScrapeURLWithPages ¶

func ScrapeURLWithPages[T any](ctx context.Context, url string, config *Config) (*PaginatedResults[T], error)

ScrapeURLWithPages fetches a URL and scrapes it with pagination, returning page-separated results

type PaginationConfig ¶

type PaginationConfig struct {
	Type         string        // "next-link" or "numbered"
	NextSelector string        // XPath for next link (next-link type)
	AltSelectors []string      // Fallback selectors for next link
	PageSelector string        // XPath for all page links (numbered type)
	Pipes        []string      // URL transformation pipes
	MaxPages     int           // Maximum pages to scrape (default: 100)
	Timeout      time.Duration // Total pagination timeout (default: 10m)
}

PaginationConfig defines pagination behavior

type PaginationError ¶

type PaginationError struct {
	PageURL      string // URL that failed
	PageNumber   int    // Page number (1-indexed)
	PartialData  any    // Items scraped before failure
	TotalScraped int    // Total items before failure
	Cause        error  // Underlying error
}

PaginationError represents an error during pagination

func (*PaginationError) Error ¶

func (e *PaginationError) Error() string

type PaginationInfo ¶

type PaginationInfo struct {
	URLs    []string // All discovered page URLs
	Type    string   // "next-link" or "numbered"
	BaseURL string   // Original base URL
}

PaginationInfo contains extracted pagination URLs

func ExtractPaginationURLs ¶

func ExtractPaginationURLs(ctx context.Context, url string, config *Config) (*PaginationInfo, error)

ExtractPaginationURLs extracts all pagination URLs without scraping

type PartialResult ¶

type PartialResult[T any] struct {
	Data   []T
	Errors map[string]error
}

PartialResult contains data and field-level errors

type PipeError ¶

type PipeError struct {
	PipeName string
	Input    string
	Params   []string
	Cause    error
}

PipeError represents an error that occurred during pipe transformation

func (*PipeError) Error ¶

func (e *PipeError) Error() string

func (*PipeError) Unwrap ¶

func (e *PipeError) Unwrap() error

type PipeFunc ¶

type PipeFunc func(ctx context.Context, input string, params []string) (any, error)

PipeFunc defines a pipe transformation function

type ScrapeError ¶

type ScrapeError struct {
	Type    ErrorType
	Message string
	XPath   string
	URL     string
	Cause   error
}

ScrapeError is a typed error with context

func (*ScrapeError) Error ¶

func (e *ScrapeError) Error() string

func (*ScrapeError) Unwrap ¶

func (e *ScrapeError) Unwrap() error

type ValidationResult ¶

type ValidationResult struct {
	XPath      string
	Valid      bool
	MatchCount int
	Error      error
}

ValidationResult represents XPath validation result for config-based validation

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples
v2/basic_json command
v2/basic_yaml command
v2/ecommerce_json command
v2/ecommerce_yaml command
v2/embed_json command
v2/embed_yaml command
v2/pagination_next_json command
v2/pagination_numbered_yaml command
v2/table_json command
v2/table_yaml command

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL