Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Channels ¶
Channels is a Page channels map where the index is the response code so we can define different behavior for the different resp codes
type Config ¶ added in v0.1.6
type Config struct {
StartURL string
AllowedDomains []string // Domains to stay within
UserAgents []string
CrawlDelay time.Duration // Delay between requests to the same domain
MaxDepth int // Maximum crawl depth
MaxRetries int // Max retries for a failed request
RequestTimeout time.Duration
QueueIdleTimeout time.Duration
ProxyURL string // e.g., "http://user:pass@host:port"
RobotsUserAgent string // User agent to use for robots.txt checks
ConcurrentRequests int // Number of concurrent fetch workers
Channels Channels
Headers map[string]string
LanguageCode string
Filters []func(*Page, *Config) bool
MaxIdleConnsPerHost int
MaxIdleConns int
Proxies []string
RequireHeadless bool
ProcessSitemaps bool
}
Config holds crawler configuration
type Crawler ¶
type Crawler struct {
// contains filtered or unexported fields
}
Crawler represents the web crawler
func NewCrawler ¶
func NewCrawler(config Config, queue queue.QueueInterface) (*Crawler, error)
NewCrawler initializes a new Crawler
type Headless ¶ added in v0.7.0
type Headless struct {
}
func NewHeadless ¶ added in v0.7.0
func NewHeadless() *Headless
Click to show internal directories.
Click to hide internal directories.