HandsomeSoup

Work with HTML more easily in HXT

https://github.com/egonSchiele/HandsomeSoup

LTS Haskell 22.30:	0.4.2
Stackage Nightly 2024-07-26:	0.4.2
Latest on Hackage:	0.4.2

See all snapshots HandsomeSoup appears in

BSD-3-Clause licensed by Aditya Bhargava

Maintained by [email protected]

This version can be pinned in stack with:HandsomeSoup-0.4.2@sha256:dcee6dce4637129d0b6cf4533dec22fb2f04115ecf70e7f43549be7941399b85,2172

Module documentation for 0.4.2

Text
- Text.CSS
  - Text.CSS.Parser
- Text.HandsomeSoup

Depends on 11 packages(full list with versions):

base, containers, HandsomeSoup, HTTP, hxt, hxt-http, mtl, network, network-uri, parsec, transformers

Used by 1 package in lts-7.24(full list with versions):

HandsomeSoup

Current Status: Usable and stable. Needs GHC 7.6. Please file bugs!

HandsomeSoup is the library I wish I had when I started parsing HTML in Haskell.

It is built on top of HXT and adds a few functions that make it easier to work with HTML.

Most importantly, it adds CSS selectors to HXT. The goal of HandsomeSoup is to be a complete CSS2 selector parser for HXT.

Install

cabal install HandsomeSoup

Example

Nokogiri, the HTML parser for Ruby, has an example showing how to scrape Google search results. This is easy in HandsomeSoup:

import Text.XML.HXT.Core
import Text.HandsomeSoup

main = do
    let doc = fromUrl "http://www.google.com/search?q=egon+schiele"
    links <- runX $ doc >>> css "h3.r a" ! "href"
    mapM_ putStrLn links

What can HandsomeSoup do for you?

Easily parse an online page using `fromUrl`

let doc = fromUrl "http://example.com"

Or a local page using `parseHtml`

contents <- readFile [filename]
let doc = parseHtml contents

Easily extract elements using `css`

Here are some valid selectors:

doc <<< css "a"
doc <<< css "*"
doc <<< css "a#link1"
doc <<< css "a.foo"
doc <<< css "p > a"
doc <<< css "p strong"
doc <<< css "#container h1"
doc <<< css "img[width]"
doc <<< css "img[width=400]"
doc <<< css "a[class~=bar]"
doc <<< css "a:first-child"

Easily get attributes using `(!)`

doc <<< css "img" ! "src"
doc <<< css "a" ! "href"

Docs

Find Haddock docs on Hackage.

I also wrote The Complete Guide To Parsing HXT With Haskell.

Credits

Made by Adit.

HandsomeSoup

Module documentation for 0.4.2

HandsomeSoup

Install

Example

What can HandsomeSoup do for you?

Easily parse an online page using fromUrl

Or a local page using parseHtml

Easily extract elements using css

Easily get attributes using (!)

Docs

Credits

Easily parse an online page using `fromUrl`

Or a local page using `parseHtml`

Easily extract elements using `css`

Easily get attributes using `(!)`