Safe Haskell | Safe-Inferred |
---|---|
Language | Haskell2010 |
Text.HTML.Scalpel
Description
Scalpel is a web scraping library inspired by libraries like parsec and Perl's Web::Scraper. Scalpel builds on top of Text.HTML.TagSoup to provide a declarative and monadic interface.
There are two general mechanisms provided by this library that are used to build web scrapers: Selectors and Scrapers.
Selectors describe a location within an HTML DOM tree. The simplest selector,
that can be written is a simple string value. For example, the selector
"div"
matches every single div node in a DOM. Selectors can be combined
using tag combinators. The //
operator to define nested relationships
within a DOM tree. For example, the selector "div" // "a"
matches all
anchor tags nested arbitrarily deep within a div tag.
In addition to describing the nested relationships between tags, selectors
can also include predicates on the attributes of a tag. The @:
operator
creates a selector that matches a tag based on the name and various
conditions on the tag's attributes. An attribute predicate is just a function
that takes an attribute and returns a boolean indicating if the attribute
matches a criteria. There are several attribute operators that can be used
to generate common predicates. The @=
operator creates a predicate that
matches the name and value of an attribute exactly. For example, the selector
"div" @: ["id" @= "article"]
matches div tags where the id
attribute is equal to "article"
.
Scrapers are values that are parameterized over a selector and produce
a value from an HTML DOM tree. The Scraper
type takes two type parameters.
The first is the string like type that is used to store the text values
within a DOM tree. Any string like type supported by Text.StringLike is
valid. The second type is the type of value that the scraper produces.
There are several scraper primitives that take selectors and extract content from the DOM. Each primitive defined by this library comes in two variants: singular and plural. The singular variants extract the first instance matching the given selector, while the plural variants match every instance.
The following is an example that demonstrates most of the features provided
by this library. Suppose you have the following hypothetical HTML located at
"http:/example.comarticle.html"
and you would like to extract a list of
all of the comments.
<html> <body> <div class='comments'> <div class='comment container'> <span class='comment author'>Sally</span> <div class='comment text'>Woo hoo!</div> </div> <div class='comment container'> <span class='comment author'>Bill</span> <img class='comment image' src='http://example.com/cat.gif' /> </div> <div class='comment container'> <span class='comment author'>Susan</span> <div class='comment text'>WTF!?!</div> </div> </div> </body> </html>
The following snippet defines a function, allComments
, that will download
the web page, and extract all of the comments into a list:
type Author = String data Comment = TextComment Author String | ImageComment Author URL deriving (Show, Eq) allComments :: IO (Maybe [Comment]) allComments =scrapeURL
"http://example.com/article.html" comments where comments :: Scraper String [Comment] comments =chroots
("div"@:
[hasClass
"container"]) comment comment :: Scraper String Comment comment = textComment<|>
imageComment textComment :: Scraper String Comment textComment = do author <-text
$ "span" @: [hasClass "author"] commentText <- text $ "div" @: [hasClass "text"] return $ TextComment author commentText imageComment :: Scraper String Comment imageComment = do author <- text $ "span" @: [hasClass "author"] imageURL <-attr
"src" $ "img" @: [hasClass "image"] return $ ImageComment author imageURL
Complete examples can be found in the examples folder in the scalpel git repository.
Synopsis
- data Selector
- data AttributePredicate
- data AttributeName
- data TagName
- tagSelector :: String -> Selector
- textSelector :: Selector
- anySelector :: Selector
- (//) :: Selector -> Selector -> Selector
- atDepth :: Selector -> Int -> Selector
- (@:) :: TagName -> [AttributePredicate] -> Selector
- (@=) :: AttributeName -> String -> AttributePredicate
- (@=~) :: RegexLike re String => AttributeName -> re -> AttributePredicate
- hasClass :: String -> AttributePredicate
- notP :: AttributePredicate -> AttributePredicate
- match :: (String -> String -> Bool) -> AttributePredicate
- type Scraper str = ScraperT str Identity
- data ScraperT str (m :: Type -> Type) a
- attr :: forall str (m :: Type -> Type). (Show str, StringLike str, Monad m) => String -> Selector -> ScraperT str m str
- attrs :: forall str (m :: Type -> Type). (Show str, StringLike str, Monad m) => String -> Selector -> ScraperT str m [str]
- html :: forall str (m :: Type -> Type). (StringLike str, Monad m) => Selector -> ScraperT str m str
- htmls :: forall str (m :: Type -> Type). (StringLike str, Monad m) => Selector -> ScraperT str m [str]
- innerHTML :: forall str (m :: Type -> Type). (StringLike str, Monad m) => Selector -> ScraperT str m str
- innerHTMLs :: forall str (m :: Type -> Type). (StringLike str, Monad m) => Selector -> ScraperT str m [str]
- text :: forall str (m :: Type -> Type). (StringLike str, Monad m) => Selector -> ScraperT str m str
- texts :: forall str (m :: Type -> Type). (StringLike str, Monad m) => Selector -> ScraperT str m [str]
- chroot :: forall str (m :: Type -> Type) a. (StringLike str, Monad m) => Selector -> ScraperT str m a -> ScraperT str m a
- chroots :: forall str (m :: Type -> Type) a. (StringLike str, Monad m) => Selector -> ScraperT str m a -> ScraperT str m [a]
- position :: forall str (m :: Type -> Type). (StringLike str, Monad m) => ScraperT str m Int
- matches :: forall str (m :: Type -> Type). (StringLike str, Monad m) => Selector -> ScraperT str m ()
- scrape :: StringLike str => Scraper str a -> [Tag str] -> Maybe a
- scrapeStringLike :: StringLike str => str -> Scraper str a -> Maybe a
- scrapeT :: StringLike str => ScraperT str m a -> [Tag str] -> m (Maybe a)
- scrapeStringLikeT :: (StringLike str, Monad m) => str -> ScraperT str m a -> m (Maybe a)
- type URL = String
- fetchTags :: StringLike str => URL -> IO [Tag str]
- fetchTagsWithConfig :: StringLike str => Config str -> URL -> IO [Tag str]
- scrapeURL :: StringLike str => URL -> Scraper str a -> IO (Maybe a)
- scrapeURLWithConfig :: StringLike str => Config str -> URL -> Scraper str a -> IO (Maybe a)
- data Config str = Config {}
- type Decoder str = Response ByteString -> str
- defaultDecoder :: StringLike str => Decoder str
- utf8Decoder :: StringLike str => Decoder str
- iso88591Decoder :: StringLike str => Decoder str
- type SerialScraper str a = SerialScraperT str Identity a
- data SerialScraperT str (m :: Type -> Type) a
- inSerial :: forall str (m :: Type -> Type) a. (StringLike str, Monad m) => SerialScraperT str m a -> ScraperT str m a
- stepNext :: forall str (m :: Type -> Type) a. (StringLike str, Monad m) => ScraperT str m a -> SerialScraperT str m a
- stepBack :: forall str (m :: Type -> Type) a. (StringLike str, Monad m) => ScraperT str m a -> SerialScraperT str m a
- seekNext :: forall str (m :: Type -> Type) a. (StringLike str, Monad m) => ScraperT str m a -> SerialScraperT str m a
- seekBack :: forall str (m :: Type -> Type) a. (StringLike str, Monad m) => ScraperT str m a -> SerialScraperT str m a
- untilNext :: forall str (m :: Type -> Type) a b. (StringLike str, Monad m) => ScraperT str m a -> SerialScraperT str m b -> SerialScraperT str m b
- untilBack :: forall str (m :: Type -> Type) a b. (StringLike str, Monad m) => ScraperT str m a -> SerialScraperT str m b -> SerialScraperT str m b
Selectors
Selector
defines a selection of an HTML DOM tree to be operated on by
a web scraper. The selection includes the opening tag that matches the
selection, all of the inner tags, and the corresponding closing tag.
Instances
IsString Selector | |
Defined in Text.HTML.Scalpel.Internal.Select.Types Methods fromString :: String -> Selector # |
data AttributePredicate #
An AttributePredicate
is a method that takes a Attribute
and
returns a Bool
indicating if the given attribute matches a predicate.
data AttributeName #
The AttributeName
type can be used when creating Selector
s to specify
the name of an attribute of a tag.
Constructors
AnyAttribute | |
AttributeString String |
Instances
IsString AttributeName | |
Defined in Text.HTML.Scalpel.Internal.Select.Types Methods fromString :: String -> AttributeName # |
Instances
IsString TagName | |
Defined in Text.HTML.Scalpel.Internal.Select.Types Methods fromString :: String -> TagName # |
tagSelector :: String -> Selector #
A selector which will match all text nodes.
Wildcards
anySelector :: Selector #
A selector which will match any node (including tags and bare text).
Tag combinators
Attribute predicates
(@:) :: TagName -> [AttributePredicate] -> Selector infixl 9 #
The @:
operator creates a Selector
by combining a TagName
with a list
of AttributePredicate
s.
(@=) :: AttributeName -> String -> AttributePredicate infixl 6 #
The @=
operator creates an AttributePredicate
that will match
attributes with the given name and value.
If you are attempting to match a specific class of a tag with potentially
multiple classes, you should use the hasClass
utility function.
(@=~) :: RegexLike re String => AttributeName -> re -> AttributePredicate infixl 6 #
The @=~
operator creates an AttributePredicate
that will match
attributes with the given name and whose value matches the given regular
expression.
hasClass :: String -> AttributePredicate #
The classes of a tag are defined in HTML as a space separated list given by
the class
attribute. The hasClass
function will match a class
attribute
if the given class appears anywhere in the space separated list of classes.
notP :: AttributePredicate -> AttributePredicate #
Negates an AttributePredicate
.
match :: (String -> String -> Bool) -> AttributePredicate #
The match
function allows for the creation of arbitrary
AttributePredicate
s. The argument is a function that takes the attribute
key followed by the attribute value and returns a boolean indicating if the
attribute satisfies the predicate.
Scrapers
data ScraperT str (m :: Type -> Type) a #
Instances
Primitives
attr :: forall str (m :: Type -> Type). (Show str, StringLike str, Monad m) => String -> Selector -> ScraperT str m str #
attrs :: forall str (m :: Type -> Type). (Show str, StringLike str, Monad m) => String -> Selector -> ScraperT str m [str] #
The attrs
function takes an attribute name and a selector and returns the
value of the attribute of the given name for every opening tag
(possibly nested) that matches the given selector.
s = "<div id=\"out\"><div id=\"in\"></div></div>" scrapeStringLike s (attrs "id" "div") == Just ["out", "in"]
html :: forall str (m :: Type -> Type). (StringLike str, Monad m) => Selector -> ScraperT str m str #
htmls :: forall str (m :: Type -> Type). (StringLike str, Monad m) => Selector -> ScraperT str m [str] #
The htmls
function takes a selector and returns the html string from
every set of tags (possibly nested) matching the given selector.
s = "<div><div>A</div></div>" scrapeStringLike s (htmls "div") == Just ["<div><div>A</div></div>", "<div>A</div>"]
innerHTML :: forall str (m :: Type -> Type). (StringLike str, Monad m) => Selector -> ScraperT str m str #
The innerHTML
function takes a selector and returns the inner html string
from the set of tags described by the given selector. Inner html here meaning
the html within but not including the selected tags.
This function will match only the first set of tags matching the selector, to
match every set of tags, use innerHTMLs
.
innerHTMLs :: forall str (m :: Type -> Type). (StringLike str, Monad m) => Selector -> ScraperT str m [str] #
The innerHTMLs
function takes a selector and returns the inner html
string from every set of tags (possibly nested) matching the given selector.
s = "<div><div>A</div></div>" scrapeStringLike s (innerHTMLs "div") == Just ["<div>A</div>", "A"]
text :: forall str (m :: Type -> Type). (StringLike str, Monad m) => Selector -> ScraperT str m str #
texts :: forall str (m :: Type -> Type). (StringLike str, Monad m) => Selector -> ScraperT str m [str] #
The texts
function takes a selector and returns the inner text from every
set of tags (possibly nested) matching the given selector.
s = "<div>Hello <div>World</div></div>" scrapeStringLike s (texts "div") == Just ["Hello World", "World"]
chroot :: forall str (m :: Type -> Type) a. (StringLike str, Monad m) => Selector -> ScraperT str m a -> ScraperT str m a #
The chroot
function takes a selector and an inner scraper and executes
the inner scraper as if it were scraping a document that consists solely of
the tags corresponding to the selector.
This function will match only the first set of tags matching the selector, to
match every set of tags, use chroots
.
chroots :: forall str (m :: Type -> Type) a. (StringLike str, Monad m) => Selector -> ScraperT str m a -> ScraperT str m [a] #
The chroots
function takes a selector and an inner scraper and executes
the inner scraper as if it were scraping a document that consists solely of
the tags corresponding to the selector. The inner scraper is executed for
each set of tags (possibly nested) matching the given selector.
s = "<div><div>A</div></div>" scrapeStringLike s (chroots "div" (pure 0)) == Just [0, 0]
position :: forall str (m :: Type -> Type). (StringLike str, Monad m) => ScraperT str m Int #
The position
function is intended to be used within the do-block of a
chroots
call. Within the do-block position will return the index of the
current sub-tree within the list of all sub-trees matched by the selector
passed to chroots
.
For example, consider the following HTML:
<article> <p> First paragraph. </p> <p> Second paragraph. </p> <p> Third paragraph. </p> </article>
The position
function can be used to determine the index of each <p>
tag
within the article
tag by doing the following.
chroots "article" // "p" $ do index <- position content <- text "p" return (index, content)
Which will evaluate to the list:
[ (0, "First paragraph.") , (1, "Second paragraph.") , (2, "Third paragraph.") ]
matches :: forall str (m :: Type -> Type). (StringLike str, Monad m) => Selector -> ScraperT str m () #
The matches
function takes a selector and returns ()
if the selector
matches any node in the DOM.
Executing scrapers
scrapeStringLike :: StringLike str => str -> Scraper str a -> Maybe a #
The scrapeStringLike
function parses a StringLike
value into a list of
tags and executes a Scraper
on it.
scrapeStringLikeT :: (StringLike str, Monad m) => str -> ScraperT str m a -> m (Maybe a) #
The scrapeStringLikeT
function parses a StringLike
value into a list of
tags and executes a ScraperT
on it. Since ScraperT
is a monad
transformer, the result is monadic.
fetchTags :: StringLike str => URL -> IO [Tag str] Source #
Download and parse the contents of the given URL.
fetchTagsWithConfig :: StringLike str => Config str -> URL -> IO [Tag str] Source #
Download and parse the contents of the given URL with the given Config
.
scrapeURL :: StringLike str => URL -> Scraper str a -> IO (Maybe a) Source #
The scrapeURL
function downloads the contents of the given URL and
executes a Scraper
on it.
The default behavior is to use the global manager provided by
http-client-tls (via getGlobalManager
). Any exceptions thrown by
http-client are not caught and are bubbled up to the caller.
scrapeURLWithConfig :: StringLike str => Config str -> URL -> Scraper str a -> IO (Maybe a) Source #
The scrapeURLWithConfig
function takes a Config
record type and
downloads the contents of the given URL and executes a Scraper
on it.
A record type that determines how scrapeURLWithConfig
interacts with the
HTTP server and interprets the results.
Instances
StringLike str => Default (Config str) Source # | |
Defined in Text.HTML.Scalpel.Internal.Scrape.URL |
type Decoder str = Response ByteString -> str Source #
A method that takes a HTTP response as raw bytes and returns the body as a string type.
defaultDecoder :: StringLike str => Decoder str Source #
The default response decoder. This decoder attempts to infer the character set of the HTTP response body from the `Content-Type` header. If this header is not present, then the character set is assumed to be `ISO-8859-1`.
utf8Decoder :: StringLike str => Decoder str Source #
A decoder that will always decode using `UTF-8`.
iso88591Decoder :: StringLike str => Decoder str Source #
A decoder that will always decode using `ISO-8859-1`.
Serial Scraping
type SerialScraper str a = SerialScraperT str Identity a #
A SerialScraper
allows for the application of Scraper
s on a sequence of
sibling nodes. This allows for use cases like targeting the sibling of a
node, or extracting a sequence of sibling nodes (e.g. paragraphs (<p>)
under a header (<h2>)).
Conceptually serial scrapers operate on a sequence of tags that correspond to the immediate children of the currently focused node. For example, given the following HTML:
<article> <h1>title</h1> <h2>Section 1</h2> <p>Paragraph 1.1</p> <p>Paragraph 1.2</p> <h2>Section 2</h2> <p>Paragraph 2.1</p> <p>Paragraph 2.2</p> </article>
A serial scraper that visits the header and paragraph nodes can be executed with the following:
chroot
"article" $inSerial
$ do ...
Each SerialScraper
primitive follows the pattern of first moving the focus
backward or forward and then extracting content from the new focus.
Attempting to extract content from beyond the end of the sequence causes the
scraper to fail.
To complete the above example, the article's structure and content can be extracted with the following code:
chroot
"article" $inSerial
$ do title <-seekNext
$text
"h1" sections <- many $ do section <-seekNext
$ text "h2" ps <-untilNext
(matches
"h2") (many $seekNext
$text
"p") return (section, ps) return (title, sections)
Which will evaluate to:
("title", [ ("Section 1", ["Paragraph 1.1", "Paragraph 1.2"]), ("Section 2", ["Paragraph 2.1", "Paragraph 2.2"]), ])
data SerialScraperT str (m :: Type -> Type) a #
Run a serial scraper transforming over a monad m
.
Instances
inSerial :: forall str (m :: Type -> Type) a. (StringLike str, Monad m) => SerialScraperT str m a -> ScraperT str m a #
Executes a SerialScraper
in the context of a Scraper
. The immediate
children of the currently focused node are visited serially.
Primitives
stepNext :: forall str (m :: Type -> Type) a. (StringLike str, Monad m) => ScraperT str m a -> SerialScraperT str m a #
Move the cursor forward one node and execute the given scraper on the new focused node.
stepBack :: forall str (m :: Type -> Type) a. (StringLike str, Monad m) => ScraperT str m a -> SerialScraperT str m a #
Move the cursor back one node and execute the given scraper on the new focused node.
seekNext :: forall str (m :: Type -> Type) a. (StringLike str, Monad m) => ScraperT str m a -> SerialScraperT str m a #
Move the cursor forward until the given scraper is successfully able to execute on the focused node. If the scraper is never successful then the serial scraper will fail.
seekBack :: forall str (m :: Type -> Type) a. (StringLike str, Monad m) => ScraperT str m a -> SerialScraperT str m a #
Move the cursor backward until the given scraper is successfully able to execute on the focused node. If the scraper is never successful then the serial scraper will fail.
untilNext :: forall str (m :: Type -> Type) a b. (StringLike str, Monad m) => ScraperT str m a -> SerialScraperT str m b -> SerialScraperT str m b #
Create a new serial context by moving the focus forward and collecting nodes until the scraper matches the focused node. The serial scraper is then executed on the collected nodes.
The provided serial scraper is unable to see nodes outside the new restricted context.
untilBack :: forall str (m :: Type -> Type) a b. (StringLike str, Monad m) => ScraperT str m a -> SerialScraperT str m b -> SerialScraperT str m b #
Create a new serial context by moving the focus backward and collecting nodes until the scraper matches the focused node. The serial scraper is then executed on the collected nodes.