parser

package
v0.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 3, 2026 License: AGPL-3.0 Imports: 8 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func SimHash

func SimHash(text string) uint64

SimHash computes a 64-bit SimHash fingerprint of the visible text content. Near-duplicate pages will have SimHash values with low Hamming distance. Use bitCount(bitXor(a, b)) in ClickHouse to compare.

Types

type HreflangEntry

type HreflangEntry struct {
	Lang string
	URL  string
}

HreflangEntry represents a hreflang link.

type Image

type Image struct {
	Src    string
	Alt    string
	Width  string
	Height string
}

Image represents an image found on a page.

type Link struct {
	TargetURL  string
	AnchorText string
	Rel        string
	IsInternal bool
	Tag        string // "a", "link", "area", etc.
}

Link represents an extracted link from a page.

type PageData

type PageData struct {
	Title           string
	Canonical       string
	MetaRobots      string
	MetaDescription string
	MetaKeywords    string
	H1              []string
	H2              []string
	H3              []string
	H4              []string
	H5              []string
	H6              []string
	Links           []Link
	Images          []Image
	Hreflang        []HreflangEntry
	Lang            string
	OGTitle         string
	OGDescription   string
	OGImage         string
	SchemaTypes     []string
	WordCount       int
	ContentHash     uint64 // SimHash fingerprint of visible body text
	Resources       []PageResource
}

PageData holds all extracted SEO signals from a page.

func Parse

func Parse(body []byte, pageURL string) (*PageData, error)

Parse parses HTML body and extracts SEO signals.

type PageResource

type PageResource struct {
	URL          string
	ResourceType string // "css", "js", "font", "icon"
	IsInternal   bool
}

PageResource represents a CSS, JS, font, or icon resource referenced by a page.

func ExtractResources

func ExtractResources(doc *goquery.Document, baseURL *url.URL) []PageResource

ExtractResources extracts external resource references (CSS, JS, fonts, icons) from the document.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL