scrappy-core-0.1.0.1: html pattern matching library and high-level interface concurrent requests lib for webscraping
Safe HaskellNone
LanguageHaskell2010

Scrappy.Elem.ITextElemParser

Description

This will eventually be a beautiful interface between NLP and scrappy

Synopsis

Documentation

emptyTree :: forall a s (m :: Type -> Type) u. (ShowHTML a, Stream s m Char) => Maybe [Elem] -> Maybe (ParsecT s u m a) -> [(String, Maybe String)] -> ParsecT s u m (TreeHTML a) Source #

comment is to crash nix as reminder to move somewhere sensible data OpenStruct a = OpenStruct (Parser a) type CloseStruct a = OpenStruct a -> ClosePiece a could be even _ -> f , when the Close struct is independent of Open and I dont think this would affect speed data ClosePiece a = ClosePiece (Parser a)

paired with maybeUsefulNewUrls this would allow us to scrape an entire | site for a singular pattern | and just by virtue of basic haskell types, there's zero reason we cant | have some simple type: | data Scrapeable = Case1 A | Case2 B ... fanExistential :: Url -> (Url -> Bool) -> MaybeT m a -> MaybeT m [a] fanExistential url = do html <- getHtmlST sv url links <- flip successesM html $ hoistMaybe $ scrape (hrefParser' cond) fanExistential links

preface :: forall s (m :: Type -> Type) u pre a. Stream s m Char => ParsecT s u m pre -> ParsecT s u m a -> ParsecT s u m a Source #

class Zero a where Source #

Returns a minimum of 2 --> almost like same should be function ; same :: a -> [a] to be applied to some doc/String | note: not sure if this exists but here's where we could handle iterating names of attributes | Can generalize to ElementRep e

Methods

consumeZero :: a -> b -> b Source #

class Singleton a where Source #

Methods

consumeSingleton :: a -> b Source #

class Multiple a where Source #

Methods

consumeMultiple :: a -> b Source #

class (Zero a, Singleton a, Multiple a) => Existential a where Source #

Methods

consumeExists :: a -> b Source #

emptyTreeGroup :: forall a s (m :: Type -> Type) u. (ShowHTML a, Stream s m Char) => Maybe [Elem] -> Maybe (ParsecT s u m a) -> [(String, Maybe String)] -> ParsecT s u m [TreeHTML a] Source #

Only matches if no innerTrees | This doesn't behave exactly like a "group" function | because it allows matching on one element | but this will also never be empty

elemAny :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m (Elem' String) Source #

data Paragraph Source #

TODO(galen): these should build off each other

Constructors

Paragraph 

Fields

data Sentence Source #

Constructors

Sentence 

Fields

Instances

Instances details
Monoid Sentence Source # 
Instance details

Defined in Scrappy.Elem.ITextElemParser

Semigroup Sentence Source # 
Instance details

Defined in Scrappy.Elem.ITextElemParser

Show Sentence Source # 
Instance details

Defined in Scrappy.Elem.ITextElemParser

ShowHTML Sentence Source # 
Instance details

Defined in Scrappy.Elem.ITextElemParser

data WrittenWord Source #

Constructors

WW 

Fields

punctuation :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m Char Source #

writtenWord :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m WrittenWord Source #

Word also means bits but I mean written specifically | This can definitely be expanded upon to increase its reach | while maintaining validity

wordSeparator :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m String Source #

comma :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m String Source #

colon :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m String Source #

semiColon :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m String Source #

word' :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m String Source #

capitalizedWord :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m String Source #

number :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m String Source #

sentence :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m Sentence Source #

sentenceWhere :: forall s (m :: Type -> Type) u. Stream s m Char => ([WrittenWord] -> Bool) -> ParsecT s u m Sentence Source #

sentenceTail :: forall s (m :: Type -> Type) u. Stream s m Char => Bool -> ParsecT s u m [WrittenWord] Source #

for research: new concept: reliable generalizations of thinking

styleTags :: [String] Source #

To my understanding this should not affect how we parse; it is | only for sure a given that the result of our low level read is really | just words and so the parsers should focus on setting up the next | parser

This is built in a way that allows the idea of a sentence | to be as internally valid as possible; the sentence controls | the period mkParagraph :: [Sentence] -> Paragraph mkParagraph ss = Paragraph . mkParagraph' $ ss where mkParagraph' :: [Sentence] -> String mkParagraph' ((Sentence s):[]) = s <> ('n':[]) mkParagraph' ((Sentence s):ss) = s <> " " <> (mkParagraph' ss)

Note: will need more complex accumulator for case where an elem has two distinct text segements broken up | by an element, (rare case)

negParseOpeningTag :: forall s (m :: Type -> Type) u. Stream s m Char => [Elem] -> ParsecT s u m (Elem, Attrs) Source #

Will only match elements not specified

textChunk :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m String Source #

openOrCloseTag :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m () Source #

This will match any element open or closing tag that is not a style tag

anyEndTag :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m Char Source #

anyThingbut :: forall s (m :: Type -> Type) u. Stream s m Char => [String] -> ParsecT s u m String Source #

Despite the fun name, this is just for textChunk use

textChunkIf :: forall s (m :: Type -> Type) u. Stream s m Char => (String -> Bool) -> ParsecT s u m String Source #

plainText :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m String Source #

styleElem :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m (Elem' String) Source #

catEithers :: [Either e a] -> [a] Source #

divideUp :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m String -> ParsecT s u m [Either String String] Source #

onlyPlainText :: forall s (m :: Type -> Type) u. Stream s m Char => ParsecT s u m String Source #

data AccumITextElem a Source #

Constructors

ACT [String]