gospider

package module
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 18, 2024 License: MIT Imports: 16 Imported by: 0

README

gospider

Usage

package main

import gospider 

func main() {
    spider := gospider.NewSpider(2)
    go spider.Run()
    go func(){
        for {
            request := gospider.Request{
                Url:"https://www.baidu.com",
            }
            spider.AddRequest(request)
        }
    }()
    for {
        response,err := spider.GetResponse()
        if err!=nil {
            break
        }
        response.Xpath.ExtractFirst("//div[@class='mnav']/a/@href")
        // parse response code here
    }
}

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Request

type Request struct {
	Url         string
	Proxy       string
	Cookie      []*http.Cookie
	MaxRetryNum int
	Meta        map[string]any
	Headers     map[string]string
}

Request 请求

type Response

type Response struct {
	Request    Request
	Error      error
	Content    string
	Meta       map[string]any
	StatusCode int
	Xpath      Xpath
}

Response 响应

type Spider

type Spider struct {
	RequestQueue  chan Request
	ResponseQueue chan Response
	Client        *http.Client
	Transport     *http.Transport
	WorkerNum     int
	Stat          spiderStat
	// contains filtered or unexported fields
}

Spider 爬虫

func NewSpider

func NewSpider(worker int) Spider

NewSpider 构造函数

func (*Spider) AddRequest

func (s *Spider) AddRequest(r Request) bool

AddRequest 向请求队列添加新的请求

返回任务队列状态,

func (*Spider) Close

func (s *Spider) Close()

func (*Spider) GetResponse

func (s *Spider) GetResponse() (Response, bool)

GetResponse 获取响应队列中的数据

func (*Spider) RandTransport

func (s *Spider) RandTransport()

RandTransport 方法用于为Spider结构体生成随机的http.Transport

该方法会设置Transport的DisableKeepAlives字段为true,禁用长连接
同时会设置TLSClientConfig字段,包括跳过TLS证书验证、设置TLS协议版本范围、设置密码套件以及设置客户端会话缓存大小

func (*Spider) Run

func (s *Spider) Run()

Run 开始执行爬虫

func (*Spider) Signal added in v0.1.1

func (s *Spider) Signal()

Signal 捕获信号量并处理

type Xpath added in v0.1.1

type Xpath struct {
	X *html.Node
}

Xpath Xpath解析html

func NewXpathParser added in v0.1.1

func NewXpathParser(content []byte) Xpath

NewXpathParser Xpath构造函数

func (*Xpath) Extract added in v0.1.1

func (x *Xpath) Extract(s string) []string

Extract 获取符合条件的所有节点的文本内容

func (*Xpath) ExtractFirst added in v0.1.1

func (x *Xpath) ExtractFirst(s string) string

ExtractFirst 获取符合条件的第一个节点的文本内容

func (*Xpath) Xpath added in v0.1.1

func (x *Xpath) Xpath(s string) Xpath

Xpath 获取符合条件的第一个节点

func (*Xpath) XpathList added in v0.1.1

func (x *Xpath) XpathList(s string) []Xpath

XpathList 获取所有符合条件的节点

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL