crawl

package module

v0.0.0-...-a1d1ef6 Latest Latest Go to latest Published: Jul 24, 2017 License: Apache-2.0 Imports: 20 Imported by: 5

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/crackcomm/crawl

Links

Open Source Insights

README ¶

crawl

Lightweight library for crawlers in Go.

HTML parsing and extracting is done thanks to goquery.

Usage

You can take a look at example crawler code.

License

                             Apache License
                       Version 2.0, January 2004
                    http://www.apache.org/licenses/

Documentation ¶

Index ¶

Variables
func Attr(n Finder, attr, selector string) string
func Callbacks(v ...string) []string
func ConstructHTTPRequest(req *Request) (r *http.Request, err error)
func FindAny(finder Finder, selectors ...string) (node *goquery.Selection)
func NodeAttr(attr string) func(int, *goquery.Selection) string
func NodeResolveURL(resp *Response) func(int, *goquery.Selection) string
func NodeText(_ int, n *goquery.Selection) string
func ParseFloat(n Finder, selector string) (res float64, err error)
func ParseUint(n Finder, selector string) (res uint64, err error)
func ProxyFromContext(ctx context.Context) (addrs []string, ok bool)
func Text(n Finder, selector string) string
func WithProxy(ctx context.Context, addrs ...string) context.Context
func WriteResponseFile(r *Response, fname string) (err error)
type Crawler
- func New(opts ...Option) Crawler
type Finder
type Handler
type Job
type Middleware
type Option
- func WithConcurrency(n int) Option
- func WithDefaultHeaders(headers map[string]string) Option
- func WithDefaultTimeout(d time.Duration) Option
- func WithQueue(queue Queue) Option
- func WithQueueCapacity(n int) Option
- func WithSpiders(spiders ...func(Crawler)) Option
- func WithTransport(transport *http.Transport) Option
- func WithUserAgent(ua string) Option
type Queue
- func NewQueue(capacity int) Queue
type Request
- func (req *Request) GetMethod() string
- func (req *Request) ParseURL() (u *url.URL, err error)
- func (req *Request) String() string
type RequestError
- func (err *RequestError) Error() string
type Response
- func (r *Response) Bytes() (body []byte, err error)
- func (r *Response) Close() error
- func (r *Response) Find(selector string) *goquery.Selection
- func (r *Response) ParseHTML() (err error)
- func (r *Response) Query() *goquery.Document
- func (r *Response) Status() string
- func (r *Response) URL() *url.URL

Constants ¶

This section is empty.

Variables ¶

View Source

var DefaultHeaders = map[string]string{
	"Accept":          "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
	"Accept-Language": "en-US,en;q=0.8",
	"User-Agent":      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36",
}

DefaultHeaders - Default crawler headers.

View Source

var NodeDataPhoto = NodeAttr("data-photo")

NodeDataPhoto - Node "data-photo" attribute selector.

View Source

var NodeHref = NodeAttr("href")

NodeHref - Node "href" attribute selector.

View Source

var NodeSrc = NodeAttr("src")

NodeSrc - Node "src" attribute selector.

Functions ¶

func Attr ¶

func Attr(n Finder, attr, selector string) string

Attr - Finds node in response and returns attr content.

func Callbacks ¶

func Callbacks(v ...string) []string

Callbacks - Helper for creating list of strings (callback names).

func ConstructHTTPRequest ¶

func ConstructHTTPRequest(req *Request) (r *http.Request, err error)

ConstructHTTPRequest - Constructs a http.Request structure.

func FindAny ¶

func FindAny(finder Finder, selectors ...string) (node *goquery.Selection)

FindAny - Finds node in response and returns attr content.

func NodeAttr ¶

func NodeAttr(attr string) func(int, *goquery.Selection) string

NodeAttr - Returns node attribute selector. Helper for (*goquery.Selection).Each().

func NodeResolveURL ¶

func NodeResolveURL(resp *Response) func(int, *goquery.Selection) string

NodeResolveURL - Returns selector which takes href and resolves url. Returns helper for (*goquery.Selection).Each().

func NodeText ¶

func NodeText(_ int, n *goquery.Selection) string

NodeText - Returns node text. Helper for (*goquery.Selection).Each().

func ParseFloat ¶

func ParseFloat(n Finder, selector string) (res float64, err error)

ParseFloat - Finds node in response and parses text as float64. When text is not found returns result 0.0 and nil error. Returned error source is strconv.ParseFloat.

func ParseUint ¶

func ParseUint(n Finder, selector string) (res uint64, err error)

ParseUint - Finds node in response and parses text as uint64. When text is not found returns result 0 and nil error. Returned error source is strconv.ParseUint.

func ProxyFromContext ¶

func ProxyFromContext(ctx context.Context) (addrs []string, ok bool)

ProxyFromContext - Returns proxy from context metadata.

func Text ¶

func Text(n Finder, selector string) string

Text - Finds node in response and returns text.

func WithProxy ¶

func WithProxy(ctx context.Context, addrs ...string) context.Context

WithProxy - Sets proxies in context metadata.

func WriteResponseFile ¶

func WriteResponseFile(r *Response, fname string) (err error)

WriteResponseFile - Reads response body to memory and writes to file.

Types ¶

func New ¶

func New(opts ...Option) Crawler

New - Creates new crawler. If queue is not provided it will create a memory queue with a capacity of WithQueueCapacity seting value (default=10000).

type Job ¶

type Job interface {
	// Request - Returns crawl job.
	Request() *Request

	// Context - Returns job context.
	Context() context.Context

	// Done - Sets job as done.
	Done()
}

Job - Crawl job interface.

func WithConcurrency ¶

func WithConcurrency(n int) Option

WithConcurrency - Sets crawl concurrency. Default: 1000.

func WithDefaultHeaders ¶

func WithDefaultHeaders(headers map[string]string) Option

WithDefaultHeaders - Sets crawl default headers. Default: empty.

func WithDefaultTimeout ¶

func WithDefaultTimeout(d time.Duration) Option

WithDefaultTimeout - Sets default request timeout duration.

func WithQueue ¶

func WithQueue(queue Queue) Option

WithQueue - Sets crawl queue. Default: creates queue using NewQueue() with capacity of WitWithQueueCapacity().

func WithQueueCapacity ¶

func WithQueueCapacity(n int) Option

WithQueueCapacity - Sets queue capacity. It sets queue capacity if a queue needs to be created and it sets a capacity of channel in-memory queue. It also sets capacity of errors buffered channel. Default: 10000.

func WithSpiders ¶

func WithSpiders(spiders ...func(Crawler)) Option

WithSpiders - Registers spider on a crawler.

func WithTransport ¶

func WithTransport(transport *http.Transport) Option

WithTransport - Sets crawl HTTP transport.

func WithUserAgent ¶

func WithUserAgent(ua string) Option

WithUserAgent - Sets crawl default user-agent.

type Queue ¶

type Queue interface {
	// Get - Gets request from Queue channel.
	// Returns io.EOF if queue is empty.
	Get() (Job, error)

	// Schedule - Schedules a Request.
	// Returns io.ErrClosedPipe if queue is closed.
	Schedule(context.Context, *Request) error

	// Close - Closes the queue.
	Close() error
}

Queue - Requests queue.

func NewQueue ¶

func NewQueue(capacity int) Queue

NewQueue - Makes a new queue. Capacity argument is a capacity of requests channel.

type Request ¶

type Request struct {
	// URL - It can be absolute URL or a relative to source URL if referer is set.
	URL string `json:"url,omitempty"`
	// Method - "GET" by default.
	Method string `json:"method,omitempty"`
	// Referer - Request referer.
	Referer string `json:"referer,omitempty"`
	// Form - Form values which set as request body.
	Form url.Values `json:"form,omitempty"`
	// Query - Form values which set as url query.
	Query url.Values `json:"query,omitempty"`
	// Cookies - Request cookies.
	Cookies url.Values `json:"cookies,omitempty"`
	// Header - Header values.
	Header map[string]string `json:"header,omitempty"`
	// Raw - when set to false, it means we expect HTML response
	Raw bool `json:"raw,omitempty"`
	// Callbacks - Crawl callback list.
	Callbacks []string `json:"callbacks,omitempty"`
}

Request - HTTP Request. Multipart form is not implemented.

func (*Request) GetMethod ¶

func (req *Request) GetMethod() string

GetMethod - Returns request Method. If empty returns "GET".

func (*Request) ParseURL ¶

func (req *Request) ParseURL() (u *url.URL, err error)

ParseURL - Parses request URL. If request Source is set, parsed - URL is resolved with reference to source request URL.

func (*Request) String ¶

func (req *Request) String() string

String - Returns "{method} {url}" formatted string.

type RequestError ¶

type RequestError struct {
	*Request
	Err error
}

RequestError - Crawl error.

func (*RequestError) Error ¶

func (err *RequestError) Error() string

Error - Returns request error message.

type Response ¶

type Response struct {
	*Request
	*http.Response
	// contains filtered or unexported fields
}

Response - Crawl http response. It is expected it to be a HTML response but not required. It ALWAYS has to be released using Close() method.

func (*Response) Bytes ¶

func (r *Response) Bytes() (body []byte, err error)

Bytes - Reads response body and returns byte array.

func (*Response) Close ¶

func (r *Response) Close() error

Close - Closes response body.

func (*Response) Find ¶

func (r *Response) Find(selector string) *goquery.Selection

Find - Short for: r.Query().Find(selector).

func (*Response) ParseHTML ¶

func (r *Response) ParseHTML() (err error)

ParseHTML - Reads response body and parses it as HTML.

func (*Response) Query ¶

func (r *Response) Query() *goquery.Document

Query - Returns goquery.Document.

func (*Response) Status ¶

func (r *Response) Status() string

Status - Gets response status.

func (*Response) URL ¶

func (r *Response) URL() *url.URL

URL - Gets response request URL.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples
imdb command This is only an example, please dont harm imdb servers, if you need movies data checkout http://www.imdb.com/interfaces I can also recommend checking out source code of https://github.com/BurntSushi/goim which implements importing data into SQL databases and comes with command line search tool.	This is only an example, please dont harm imdb servers, if you need movies data checkout http://www.imdb.com/interfaces I can also recommend checking out source code of https://github.com/BurntSushi/goim which implements importing data into SQL databases and comes with command line search tool.
imdb/spider Package spider implements imdb spider.	Package spider implements imdb spider.
nsq-skeleton command
forms Package forms implements helpers for filling forms.	Package forms implements helpers for filling forms.
nsq
consumer Package consumer implements command line crawl consumer from nsq.	Package consumer implements command line crawl consumer from nsq.
crawl-schedule command
nsqcrawl
open

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

crawl

Usage

License

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func Attr ¶

func Callbacks ¶

func ConstructHTTPRequest ¶

func FindAny ¶

func NodeAttr ¶

func NodeResolveURL ¶

func NodeText ¶

func ParseFloat ¶

func ParseUint ¶

func ProxyFromContext ¶

func Text ¶

func WithProxy ¶

func WriteResponseFile ¶

Types ¶

type Crawler ¶

func New ¶

type Finder ¶

type Handler ¶

type Job ¶

type Middleware ¶

type Option ¶

func WithConcurrency ¶

func WithDefaultHeaders ¶

func WithDefaultTimeout ¶

func WithQueue ¶

func WithQueueCapacity ¶

func WithSpiders ¶

func WithTransport ¶

func WithUserAgent ¶

type Queue ¶

func NewQueue ¶

type Request ¶

func (*Request) GetMethod ¶

func (*Request) ParseURL ¶

func (*Request) String ¶

type RequestError ¶

func (*RequestError) Error ¶

type Response ¶

func (*Response) Bytes ¶

func (*Response) Close ¶

func (*Response) Find ¶

func (*Response) ParseHTML ¶

func (*Response) Query ¶

func (*Response) Status ¶

func (*Response) URL ¶

Source Files ¶

Directories ¶