Documentation
¶
Index ¶
- Variables
- func Attr(n Finder, attr, selector string) string
- func Callbacks(v ...string) []string
- func ConstructHTTPRequest(req *Request) (r *http.Request, err error)
- func FindAny(finder Finder, selectors ...string) (node *goquery.Selection)
- func NodeAttr(attr string) func(int, *goquery.Selection) string
- func NodeResolveURL(resp *Response) func(int, *goquery.Selection) string
- func NodeText(_ int, n *goquery.Selection) string
- func ParseFloat(n Finder, selector string) (res float64, err error)
- func ParseUint(n Finder, selector string) (res uint64, err error)
- func ProxyFromContext(ctx context.Context) (addrs []string, ok bool)
- func Text(n Finder, selector string) string
- func WithProxy(ctx context.Context, addrs ...string) context.Context
- func WriteResponseFile(r *Response, fname string) (err error)
- type Crawler
- type Finder
- type Handler
- type Job
- type Middleware
- type Option
- func WithConcurrency(n int) Option
- func WithDefaultHeaders(headers map[string]string) Option
- func WithDefaultTimeout(d time.Duration) Option
- func WithQueue(queue Queue) Option
- func WithQueueCapacity(n int) Option
- func WithSpiders(spiders ...func(Crawler)) Option
- func WithTransport(transport *http.Transport) Option
- func WithUserAgent(ua string) Option
- type Queue
- type Request
- type RequestError
- type Response
- func (r *Response) Bytes() (body []byte, err error)
- func (r *Response) Close() error
- func (r *Response) Find(selector string) *goquery.Selection
- func (r *Response) ParseHTML() (err error)
- func (r *Response) Query() *goquery.Document
- func (r *Response) Status() string
- func (r *Response) URL() *url.URL
Constants ¶
This section is empty.
Variables ¶
var DefaultHeaders = map[string]string{
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.8",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36",
}
DefaultHeaders - Default crawler headers.
var NodeDataPhoto = NodeAttr("data-photo")
NodeDataPhoto - Node "data-photo" attribute selector.
var NodeHref = NodeAttr("href")
NodeHref - Node "href" attribute selector.
var NodeSrc = NodeAttr("src")
NodeSrc - Node "src" attribute selector.
Functions ¶
func ConstructHTTPRequest ¶
ConstructHTTPRequest - Constructs a http.Request structure.
func NodeResolveURL ¶
NodeResolveURL - Returns selector which takes href and resolves url. Returns helper for (*goquery.Selection).Each().
func ParseFloat ¶
ParseFloat - Finds node in response and parses text as float64. When text is not found returns result 0.0 and nil error. Returned error source is strconv.ParseFloat.
func ParseUint ¶
ParseUint - Finds node in response and parses text as uint64. When text is not found returns result 0 and nil error. Returned error source is strconv.ParseUint.
func ProxyFromContext ¶
ProxyFromContext - Returns proxy from context metadata.
func WriteResponseFile ¶
WriteResponseFile - Reads response body to memory and writes to file.
Types ¶
type Crawler ¶
type Crawler interface {
// Schedule - Schedules request.
// Context is passed to queue in a job.
Schedule(context.Context, *Request) error
// Execute - Makes a http request respecting context deadline.
// If request Raw is not true - ParseHTML() method is executed on Response.
// Then all callbacks are executed with context.
Execute(context.Context, *Request) (*Response, error)
// Handlers - Returns all registered handlers.
Handlers() map[string][]Handler
// Register - Registers crawl handler.
Register(name string, h Handler)
// Middleware - Registers a middleware.
// Request is not executed if middleware returns an error.
Middleware(Middleware)
// Start - Starts the crawler.
// All errors should be received from Errors() channel.
Start()
// Close - Closes the queue and the crawler.
Close() error
// Errors - Returns channel that will receive all crawl errors.
// Only errors from queued requests are here.
// Not only request errors but also queue errors.
Errors() <-chan error
}
Crawler - Crawler interface.
type Job ¶
type Job interface {
// Request - Returns crawl job.
Request() *Request
// Context - Returns job context.
Context() context.Context
// Done - Sets job as done.
Done()
}
Job - Crawl job interface.
type Middleware ¶
Middleware - Crawler middleware.
type Option ¶
type Option func(*crawl)
Option - Crawl option.
func WithConcurrency ¶
WithConcurrency - Sets crawl concurrency. Default: 1000.
func WithDefaultHeaders ¶
WithDefaultHeaders - Sets crawl default headers. Default: empty.
func WithDefaultTimeout ¶
WithDefaultTimeout - Sets default request timeout duration.
func WithQueue ¶
WithQueue - Sets crawl queue. Default: creates queue using NewQueue() with capacity of WitWithQueueCapacity().
func WithQueueCapacity ¶
WithQueueCapacity - Sets queue capacity. It sets queue capacity if a queue needs to be created and it sets a capacity of channel in-memory queue. It also sets capacity of errors buffered channel. Default: 10000.
func WithSpiders ¶
WithSpiders - Registers spider on a crawler.
func WithTransport ¶
WithTransport - Sets crawl HTTP transport.
func WithUserAgent ¶
WithUserAgent - Sets crawl default user-agent.
type Queue ¶
type Queue interface {
// Get - Gets request from Queue channel.
// Returns io.EOF if queue is empty.
Get() (Job, error)
// Schedule - Schedules a Request.
// Returns io.ErrClosedPipe if queue is closed.
Schedule(context.Context, *Request) error
// Close - Closes the queue.
Close() error
}
Queue - Requests queue.
type Request ¶
type Request struct {
// URL - It can be absolute URL or a relative to source URL if referer is set.
URL string `json:"url,omitempty"`
// Method - "GET" by default.
Method string `json:"method,omitempty"`
// Referer - Request referer.
Referer string `json:"referer,omitempty"`
// Form - Form values which set as request body.
Form url.Values `json:"form,omitempty"`
// Query - Form values which set as url query.
Query url.Values `json:"query,omitempty"`
// Cookies - Request cookies.
Cookies url.Values `json:"cookies,omitempty"`
// Header - Header values.
Header map[string]string `json:"header,omitempty"`
// Raw - when set to false, it means we expect HTML response
Raw bool `json:"raw,omitempty"`
// Callbacks - Crawl callback list.
Callbacks []string `json:"callbacks,omitempty"`
}
Request - HTTP Request. Multipart form is not implemented.
type RequestError ¶
RequestError - Crawl error.
func (*RequestError) Error ¶
func (err *RequestError) Error() string
Error - Returns request error message.
type Response ¶
Response - Crawl http response. It is expected it to be a HTML response but not required. It ALWAYS has to be released using Close() method.
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
examples
|
|
|
imdb
command
This is only an example, please dont harm imdb servers, if you need movies data checkout http://www.imdb.com/interfaces I can also recommend checking out source code of https://github.com/BurntSushi/goim which implements importing data into SQL databases and comes with command line search tool.
|
This is only an example, please dont harm imdb servers, if you need movies data checkout http://www.imdb.com/interfaces I can also recommend checking out source code of https://github.com/BurntSushi/goim which implements importing data into SQL databases and comes with command line search tool. |
|
imdb/spider
Package spider implements imdb spider.
|
Package spider implements imdb spider. |
|
nsq-skeleton
command
|
|
|
Package forms implements helpers for filling forms.
|
Package forms implements helpers for filling forms. |
|
nsq
|
|
|
consumer
Package consumer implements command line crawl consumer from nsq.
|
Package consumer implements command line crawl consumer from nsq. |
|
crawl-schedule
command
|
|