harrisaoz / FsHtmlKit

Transform HTML documents to plain text.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Purpose

Facilitate extraction of plain text from HTML documents.

Behaviour

Use FSharp.Data to parse the HTML document into a tree structure, then a bespoke visitor to extract the desired text elements. The simple façade that is the primary means of using the library has the following behaviour:

  • comments are ignored
  • CData nodes are ignored
  • Attributes other than href are ignored
  • Text nodes are passed through to the result
  • Elements which should break a line in HTML rendering emit a new line in the result
  • Runs of whitespace are collapsed into a single whitespace

Usage

#r "nuget: FsHtmlKit"
open FsHtmlKit.Html2PlainText

let inputHtml = System.IO.File.ReadAllText "input.html"

let ``Do something if extraction fails`` () = ...
let ``Do something with the plain text`` plainText = ...
let ``Do something with text that doesn't care whether the result is html or plain text`` text = ...

tryExtractPlainText inputHtml |> function
    | None -> ``Do something if extraction fails`` ()
    | Some plainText -> ``Do something with the plain text`` plainText

html2Text inputHtml
|> ``Do something with text that doesn't care whether the result is html or plain text``

Build (for Release)

dotnet build -c Release

Package

dotnet pack -c Release

Deploy

dotnet nuget push FsHtmlKit\bin\Release\FsHtmlKit.<version>.nupkg -s <github-source-name> -k <github-package-deployment-api-key>

About

Transform HTML documents to plain text.


Languages

Language:F# 94.7%Language:HTML 5.3%