scrappy-core-0.1.0.1: html pattern matching library and high-level interface concurrent requests lib for webscraping
Safe HaskellNone
LanguageHaskell2010

Scrappy.Links

Description

DOM -> Link >>= request --> DOM -> Link ... ^^ this may be infinitely complicated by stuff such as JS

The recursive nature of scraping is the central data structure of a URL

Which makes me think that there may be more to consider at some point with the modern-uri package And doing stuff such as building site trees

Synopsis

Documentation

type Src = Url Source #

getHtmlStateful :: Url -> String Source #

Could set last url in state

fixURL :: LastUrl -> Href -> Url Source #

Generic algorithm for determining full path given last url

deriveBaseUrl :: Link -> Maybe BaseUrl Source #

the fromJust should never be called if Links are used properly

mkBaseUrl :: URI -> Maybe Link Source #

I think this is good (might also bee good lens practice tho to simplify)

class IsLink a where Source #

Methods

renderLink :: a -> Url Source #

Instances

doiParser :: forall s u (m :: Type -> Type). ParsecT s u m DOI Source #

data ReferenceSys Source #

Constructors

RefSys [String] [String] 

type Namespace = Text Source #

Name and Namespace are really same shit; might just converge | Refer to literally "name" attribute

type Option = Text Source #

This is an operationally focused type where | a certain namespace is found to have n num of Options

data QParams Source #

More for show / reasoning rn .. non-optimal

Constructors

Opt (Map Namespace [Option]) 
SimpleKV (Text, Text) 

type SiteTree = [(Bool, Text)] Source #

Inter site urls and whether they have been checked for some pattern

data DOMLink Source #

This wouldnt need to be exported as our interfaces would implement it under the hood | and return a Link'

Constructors

Href' Href 
Src Url 
PlainLink Url 

newtype Link Source #

Constructors

Link Url 

Instances

parseLink :: Bool -> Link -> Url -> Maybe Link Source #

This is a general interface for extracting a raw link | from scraping according to specs about the scraper itself | IE if it is 100% same site

maybeUsefulNewUrl :: Link -> [(Link, a)] -> Link -> Maybe Link Source #

Core function of module, filters for any links which point to other pages on the current site | and have not been found over the course of scraping the site yet | filters out urls like https://othersite.com and "#"

urlIsNew :: [(a, Url)] -> HrefURI -> Bool Source #

maybeUsefulUrl :: Link -> Link -> Maybe Link Source #

Filters javascript refs, inner page DOM refs, urls with query strings and those that | do not contain the base url of the host site

usefulNewUrls :: Link -> [(Link, a)] -> [Link] -> [Maybe Link] Source #

Input is meant to be right from