oiwn / dom-content-extraction

DOM Based Content Extraction via Text Density

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dom-content-extraction

Rust implementation of Fei Sun, Dandan Song and Lejian Liao paper:

Content Extraction via Text Density (CETD)

use dom_content_extraction::{DensityTree, get_node_text};

let dtree = DensityTree::from_document(&document); // &scraper::Html 
let sorted_nodes = dtree.sorted_nodes();
let node_id = sorted_nodes.last().unwrap().node_id;

println!("{}", get_node_text(node_id, &document));

Read documentation on docs.rs

About

DOM Based Content Extraction via Text Density

License:GNU General Public License v3.0


Languages

Language:Rust 86.8%Language:HTML 12.1%Language:Makefile 1.1%